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RESULTS OBTAINED WITH THE NON-LANGUAGE : 
GROUP TEST 

RUDOLF PINTNER K 

Teachers College, Columbia University | 


It is becoming obvious in many recent studies and discussions of | 
intelligence tests and intelligence testing, that we must discriminate { 
between different kinds of intelligence rather than continue to speak 
in a vague way of “general”’ intelligence. Thorndike has suggested 
a three-fold classification of intelligence into abstract, mechanical 
and social. Abstract intelligence may be thought of as the ability 
to react to symbols of all sorts whether numbers, words, forms, or ‘ 
diagrams. Mechanical intelligence is the ability to handle concrete 
things, such as automobiles, tools, instruments, doors, locks, stoves, : 
and the everyday things of life. In some respects it might be better 
to call this sort of intelligence concrete rather than mechanical, 
to show the contrast between it and the abstract type of intelligence, 
and to suggest that it is not limited to mechanical things in the 
narrow sense. Social intelligence is the ability to understand men 
and women, to understand and react adequately to social situations. 


These kinds of intelligence naturally overlap a great deal and many ff 
adjustments to situations in life involve two or more kinds. There is i | 
nothing hard and fast about this division of intelligence into three ’ 


kinds, and it would be perfectly feasible to make a division into more t 
than three kinds. : 7 
In testing intelligence of any kind, we may employ verbal or non- 
verbal material or a mixture of these. Thus we may ask questions 1 
about concrete situations and social situations, and obtain some meas- 
ure of concrete and social intelligence by means of verbal responses, | 
473 1 
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written or oral. Most of our tests of abstract intelligence have 
been largely verbal in character. To the extent that non-verbal 
material has been used, such as numbers, forms or pictures, the instruc- 
tions or directions have been verbal in nature and to this extent the 
test has depended upon the understanding of language. 

Non-verbal tests range all the way from those absolutely depend- 
ent upon the understanding of verbal instructions, in which the under- 
standing of the instructions is a part of the test proper, to tests where 
the verbal instructions are only a minor feature of the test proper, 
and finally to tests where there are no verbal instructions, but where 
gesture and example is sufficient to enable the subject to perform the 
test. Most of our kindergarten and first-grade tests are non-verbal in 
type but all of them give the directions in words and an understanding 
of these words really forms an integral part of the test proper. The 
Army Beta Test employs a minimum of words, depending mainly upon 
gesture and example. Only to a slight extent, if at all, does under- 
standing of language affect the test proper. The Stenquist Mechan- 
ical Assembling Tests, the Pintner-Paterson Performance Scale, and 
the Pintner Non-language Test are examples of tests which may be 
given without any dependence upon verbal directions. 

Tests employing non-verbal material have been relatively little 
used as tests of intelligence. They are harder to construct, harder 
to give and score. They do not correlate very highly with verbal 
intelligence tests, and if one accepts the verbal type of test as the 
sole criterion of intelligence, the non-verbal tests must then be con- 
sidered poor instruments for measuring intelligence. If, however, 
we consider them as measuring a different aspect of intelligence 
from that presumably measured by the verbal type of test, then the 
lack of a high correlation between the two types is understandable. 
Gates! finds the mean correlation between a great many verbal and 
non-verbal tests for one grade to be 0.24. The tests used employed 
non-verbal material, but depended upon language for the under- 
standing of directions.’ With Stanford MA these same non-verbal 
tests correlated 0.16, while verbal group tests correlated 0.47. 
Evidently the non-verbal group tests are not as closely related to the 
Stanford Binet Tests as are the verbal tests. 





1Gates, A. I.: The Correlations of Achievement in School Subjects with 
Intelligence Tests. Journal of Educational Psychology, Vol. XIII, Nos. 3, 4, and 
5, March—May. 
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Stenquist! also reports correlations between his mechanical tests 
and a composite of six intelligence tests, only one being non-verbal in 
character. These correlations range from .64 for his picture test to 
.23 for his assembling test. For a composite of his four mechanical 
tests and a composite of the same six intelligence tests he reports a 
correlation of .21. Evidently there is not a high correlation between 
mechanical intelligence and verbal intelligence within a range of one 
or two grades. 

The present writer’s Non-language Test? is a group test composed 
of non-verbal material. Furthermore, the test can be given without 
the use of language. The test does not call for the knowledge of 
mechanical information, as do the Stenquist tests. The Non-language 
Test may be considered as testing a less abstract type of intelligence 
than is tested by the usual verbal group intelligence test. At the same 
time it is not as concrete as the Stenquist Assembling Tests or the 
Pintner-Paterson Performance Scale. In the course of work with 
this test, several correlations with verbal tests have been reported, 
and several have been calculated by the writer. These are set forth 
in Table I. The correlations between scores or MA’s are all positive 
and range from 25 to 72. Most of them lie between 30 and 50. The 
three highest correlations, .72, .71 and .65 are for groups of adults 
where the range of talent concerned is wide. The two next highest 
coefficients, .50 and .51 are obtained from the two samples showing 
the widest spread in grade, namely Grades IV to VIII and Grades V 
to VIII respectively. When the range in grade is restricted, as in the 
other samples, all the coefficients fall below .45. For single grade 
groups the coefficients fluctuate from .25 to .40. What the true cor- 
relation between the Non-language Test and the usual type of 
verbal test, as typified by the Army Alpha or National, may be, will 
depend upon our view of correlation. For the total population 
including the whole range of talent, the correlation may well be .71 
or .81; for the restricted range of talent usually found within a single 
grade the correlation is probably not more than .3 or .4._ Evidently 
this concrete type of intelligence tested by this test is positively 
correlated with the abstract type of intelligence, although within the 
restricted range of one grade the correlation is not high. Most of 





1 Stenquist, J. L.: Measurements of Mechanical Ability. Teachers College 
Contributions to Education, No. 130, Teachers College, Columbia University, 1923. 

2“The Pintner Non-language Test, Manual and Test Blanks.’’ College Book 
Company, Columbus, Ohio. 
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the groups here reported are made up largely of children from non- 
English-speaking homes. In all probability groups composed of 
purely English-speaking American children would show slightly 
higher correlations. 


TABLE I.—CoRRELATIONS OF THE PINTNER NON-LANGUAGE TEST WITH VARIOUS 
VERBAL Group TEsTs 











Correlations 
Test Reported by | Cases Grade 
Score IQ 
Army Alpha.......... Hamill 50 | Policemen t2 
Army Alpha.......... Pintner 25 | Tradesmen 71 
RR Poull 620 | V to VIII 51 
National.............} Poull 188 | VIA 31 
ees Poull 106 | VIA .25 
National............. Pintner 34 |IVA .40 
LS 6. os 6k ee Pintner 111 /|IV to VIII .50 .57 
National.............| Pintner 286 | III and IV .29 54 
Ps occvcccstver Pintner 81 |IVA .37 .64 
ee Pintner 122. | IV to VI .42 
ae Pintner 192 |IV to VI .63 
I ai ae ob cael Bere 272 |10yr.old Italians} .66 
ison 6 dere Bere 260 | 10yr.old Italians} .65 
SS 6 34 4 6-6'aa ae oh Pintner 122 |IV to VI .36 
EE eee Pintner 192 (|IV to VI Sea .52 
Otis Higher.......... Pintner 22 | Tradesmen .65 
Illinois Intelligence... .| Goldberger 186 | VII to II .64 .67 




















The correlations between 1Q’s derived from the tests in question 
show higher coefficients than in the case of the correlations between 
scores of MA’s. Spurious index correlation is at work here.! The 
narrower the range of the group, the higher is the IQ correlation 
boosted. Where the range is fairly wide as in Grades IV to VIII, there 
is relatively little difference between the two coefficients. 

Validity—The validity of a test has been defined as the cor- 
respondence between the ability measured by the test and ability 
otherwise objectively defined and measured. The most frequent cri- 
terion for intelligence tests has been some measure of school achieve- 





1 See article by Thomson, G. H. and Pintner, R.: Spurious Correlation and 


Relationship between Tests. Journal of Educational Psychology, Vol. XV, No. 7, 
pp. 423-444. 
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ment. The simplest and easiest measure of school achievement is the 
educational rating of the pupil. Class marks and teachers’ judgments 
are always available, and often the only available criteria. In spite of 
the unreliability of both marks and judgments, they have to be used. 
Naturally the verbal type of intelligence test stands a better chance 
than the non-verbal type of showing a high correlation with such a 
criterion, and it has been frequently shown (see especially Gates, 
op. cit.) that verbal tests correlate much higher than do non-verbal 
with school achievement. If, however, we broaden our definition of 
intelligence to include concrete or mechanical intelligence, we will be 
forced to broaden our criterion against which we check up our tests. 
For ultimately our aim must be not merely to predict success in the 
narrow field of the academic subjects of the curriculum, but rather in 
the whole school situation, and finally in life-situations as well as in 
school-situations. 

So far, the best study of the validity of the Non-language Test 
has been made by Liu.' His criterion of intelligence was a composite 
made up of chronological age, school marks, teachers’ estimates, 
school progress, and four other intelligence tests. With this elaborate 
criterion the Non-language Test correlates .78 for a group of 235 
children in Grades II to IV, and this shows a very satisfactory degree 
of validity. 

Very different from this is the degree of relationship between the 
Non-language Test and the estimates of teachers as to the intelligence 
of their pupils. In one school, where the children were largely of 
Italian parentage, the following coefficients were obtained between the 
teachers’ estimates of intelligence and the 1Q’s of the test. The 
correlations are between ranks in each case. 


GRADE CoEeFFICIENT 1 GRADE COEFFICIENT n 
VIB + .26 40 VA opposite 1 + .20 28 
VI B opposite 1 +.12 25 VA opposite 2 +.17 25 
VIA + .40 31 IVB +.41 33 
VI B opposite 2 +.12 10 IV B opposite 1 — .23 33 
VI A opposite 1 — .07 16 IV B opposite 2 — .20 14 
VI A opposite 2 + .28 26 IVA +.19 33 

VB + .26 32 IV A opposite 1 — .28 16 
V B opposite 2 + .38 28 IV A opposite 2 + .34 27 
VA + .23 37 C +.18 11 





1Liu, H. C.: Non-verbal Intelligence Tests for Use in China. Teachers’ 
College Contributions to Education, No. 126, Teachers’ College, Columbia Univer- 


sity, 1922. 
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These coefficients range from —.28 to +.41, with a median about 
+.195, from which it is obvious that on the whole there is practically 
no relationship between the teacher’s idea of the intelligence of the 
child and the rating obtained by the test. These coefficients are 
much lower than those between the Non-language Test and other 
verbal tests of intelligence. In all probability the teacher is greatly 
influenced by the success of the child in academic work and such suc- 
cess with children of foreign parentage will be partly determined by 
the language ability of the child. In addition to this the ranking of 
children according to intelligence is known to be very unreliable. 

The three teachers whose rankings gave coefficients of —.23, —.20 
and —.28 were asked to rank their children a second time after an 
interval of about two months, without knowing the correlations 
obtained by their first rankings. They were asked to do this by the 
principal and were not aware for what purpose such ranking was to be 
used. These second rankings correlated with the test ratings -.01, 
+.18 and —.09 respectively, all closer to the test ranking than in the 
first case. The reliability of these teachers’ rankings is shown by the 
correlation of the first with the second rankings, and these reliabilities 
are +.58, —.20 and +.39 respectively, one of them negative and the 
other two much lower than the reliability of the test itself. 

Somewhat different were the correlations between teacher’s esti- 
mates and ratings on the Non-language Test obtained in 17 classes in 
an institution for the deaf. These are as follows: 


Crass CoEFFICIENT n Crass CoEFFICIENT n 
1 + .88 9 9 + .37 12 

2 +.44 11 10 + .26 11 

3 — .06 10 11 + .68 12 

4 + .82 10 13 + .79 12 

5 + .40 12 14 + .46 7 

6 +.76 12 15 + .76 10 

7 +.35 10 16 + .88 8 

8 + .58 12 17 +.48 8 
18 — .09 6 


Here we note very few low or negative correlations. There are 
several very high coefficients. The range is from -.09 to +.88, with 
a median about +.48. We also note that the size of the class in the 
deaf school is considerably smaller than in the school in which the 
children hear, and we may infer that the teacher is more intimately 
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acquainted with each of her pupils. If this is true, then we may sup- 
pose that the better the teacher knows the abilities of her pupils, 
the more likely are we to obtain satisfactory correlations between her 
estimates and the objective measures of the tests. 

Reliability—The best measure of the reliability of a group test 
is the correlation between two forms of the test given to the same 
pupils with a very short time interval intervening. As there is only 
one form of the Non-language Test, the next best measure of reliability, 
a repetition of the same test after a short interval, has been used. This 
method of obtaining a coefficient of reliability gives according to 
Kelley! the upper limit of the degree of reliability, because of the 
possibility of memory transfer or the correlation between errors. 

A group of 201 pupils in Grades IV, V and VI were given the test 
twice, with an interval of two days between the first and second 
attempts, with the following results: 


Correlation between scores—first with second attempt, r=.79 +.017 
Partial correlation with CA constant................ r= .733 + .022 
Partial correlation with CA and grade constant....... r = .726+ .022 
Correlation between IQ’s; first with second attempt. ..r=.735 + .022 


These correlations are very satisfactory considering the narrow- 
ness of the range of talent tested. According to Kelley,? ‘‘to secure 
a reliability coefficient of 0.40 from a group composed of children 
in a single grade is probably indicative of greater, not less, reliability 
than to secure a reliability coefficient of 0.90 from a group composed of 
children from the second to twelfth grades.” We may assume, 
therefore, that whatever kind of intelligence the test is measuring, it is 
measuring it fairly consistently. 

Closely connected with the reliability of a test, and frequently so 
used, is the stability of the intelligence ratings given to children over a 
long period of time. Reamer* reports the results of the Non-language 
Test given to the same children after an interval of almost two years. 
Using the mental indices, as showing the relative ratings of the 
children, she reports a correlation of .73. The number of children 
was 215. In another school for the deaf the test was given in April, 





1 Kelley, T. L.: “Statistical Method.” Macmillan Co., 1923, p. 203. 

2 Kelley, T. L.: The Reliability of Test Scores. Journal of Educational 
Research, Vol. III, No. 5, May, 1921, pp. 370-379. 

3 Reamer, J. C.: Mental and Educational Measurements of the Deaf. Psy- 
chological Review Monographs, No. 132. Princeton, N. J., 1921, pp. 47-48. 
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1919 and repeated in May, 1922. There were 81 pupils who repeated 
the test after this interval of three years and the correlation of the men- 
tal indices was .75. The mean mental index in 1919 was 61 and in 
1922 it was 60. These two coefficients .73 and .75 indicate a fair 
degree of constancy in the mental ratings given to children on the 
Non-language Test. These coefficients are not as high as most of 
those reported for Stanford-Binet IQ’s, which on the whole, tend to 
fluctuate between .80 and .90. We must remember, however, that 
the Non-language Test is merely a short 30 minute test, whereas the 
Stanford is a long 60 or 70 minute examination. 

Nationality—One obvious advantage of non-verbal tests is that 
they make possible a more direct comparison between English and 
non-English speaking groups. How much a verbal test handicaps a 
child coming from a home in which English is not the sole language 
used, it is difficult to tell. All comparisons between foreign or mixed 
foreign and American children—the typical mixture found in most 
schools in large cities—have shown a much lower MA and IQ on the 
verbal type of test as compared with the non-verbal. Furthermore the 
percentage of the foreign reaching or exceeding the median of the 
American is always greater on the non-verbal than on the verbal test. 
In one school this percentage for the Non-language Test was 50 and 
for the National 37.!_ It is foolish to dismiss evidence of this sort with 
the statement that language ability is necessary for school work and 
that, therefore, the verbal test gives the best prediction of academic 
achievement and is all that we require. On the contrary, we want to 
know the intelligence of the child freed from language handicaps so 
that we may know which children will repay our best teaching and 
so salvage some of the intelligence that is going to waste owing to 
our rigid curriculum with its undue emphasis upon verbal types of 
material. 

As to nationality differences on the test itself, the following evi- 
dence is available. The records of all the 12-year old children have 
been taken and distributed according to the family name into Anglo- 
Saxon names, Italian names, German and Dutch names. All names 
about which there might be any doubt were put together into a group 
called ‘‘Miscellaneous.”’ This distribution of the 1361 12-year old 
children, tested up to date, gave the following frequencies: 





1 Pintner, R.: Comparison of American and Foreign Children on Intelligence 
Tests. Journal of Educational Psychology, Vol. XIV, No. 5, May, 1923, pp. 292- 
295. 
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Per cent 
Number of total Mean score SD 
Anglo-Saxon names......... 653 48.0 329 100 
Miscellaneous names........ 329 24.2 319 94 
Italian names.............. 218 16.0 305 91 
German, Dutch names...... 161 11.8 313 91 
ll eel ag TR + 1361 | 100.0 321 97 




















The large percentage of names in the “Miscellaneous” group 
indicates the rigidity with which the grouping was made, only very 
obviously Anglo-Saxon, Italian, etc., names being placed in those 
respective groups. The largest mean score and standard deviation 
is obtained by the Anglo-Saxon group. The differences between the 
mean scores, the probable errors of these differences and the ratios 
of these differences to their probable errors for the various groups are 
as follows: 














| 

| Difference , PE Ratios 

| difference 
Anglo-Saxon — Italian............... | 24 4.92 4.9 
Anglo-Saxon — German.............. | 6 | 5.50 1.1 
Anglo-Saxon — Miscellaneous........ | 10 | 4.38 2.3 
Anglo-Saxon — Total................ | 8 | 3.17 2.5 
Italian — German................... | — 8 6.37 1.3 
Italian — Miscellaneous.............. —14 5.39 2.6 
a ask a6 6) 6 cadens ves | —16 4.51 3.5 
German — Miscellaneous............ | — 6 5.93 1.0 
ES | — 8 5.12 1.6 
Miscellaneous — Total............... | — 2 3.91 0.5 








The difference between the Anglo-Saxon group and the Italian 
group is in all probability a real difference. The other differences 
are not significant. This adds to the evidence of other workers as to 
the lower mental ability of the Italians in this country. It is of 
interest to note that this is so, even on a test not involving any lan- 
guage ability. The difference between the Americans of presumably 
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Anglo-Saxon stock and the Italians is much greater on language tests 
but it is still marked on non-language tests. The difference between 
children of American parentage and children of Italian parentage 
living in the same neighborhood and frequenting the same school 
would seem to be zero on non-language tests but quite marked on 
verbal tests.! 

The assumption underlying these conclusions is that the popula- 
tion of 1361 12-year old children tested on the Non-language Test 
represents a fair sampling of the population of the country as a 
whole. The difference between the Italian and Anglo-Saxon groups 
found here is not nearly as marked or reliable as that found by Brig- 
ham? between the American group and the Italian group in his analysis 
of the Army results. The ratio of the difference to the PE of the 
difference in his case is 96.8, much larger than ours. The army results, 
however, include foreign-born men who took the Alpha or Language 
Test as well as those who took the Beta or Non-language Test. Of 
these foreign-born Italians 16.7 per cent were rated on language tests, 
either the Alpha or the Binet. 

Sex Differences.—Poull* reports a marked difference between the 
scores of boys and girls on the Non-language Test. Her median IQ 
for girls is 92 whereas that for boys is 101.5. Is this true in general or 
merely for the particular group studied? 

An analysis of the total population tested at ages 10 and 12 has 
been made with the following results: 

















Age 10 
Girls Boys Total 

CET Senne UBLD Ah, Penns = z Be ee i ee eee 
TEED Ce. 257 | 254 255.5 
ad I Pe eR oF as | 90 | 98 94 
Number of cases...........;3..... 455 | 469 924 
ee ae | 3 | 
NN lve du bbe | 4.2 | 

| | 





1 Pintner, R.: Op. cit. 

2 Brigham, C. C.: ‘‘A Study of American Intelligence.”’ Princeton University 
Press, 1923. 

§ Poull, L. E.: Interests in Relation to Intelligence. Ungraded, Vol. VII, Nos. 
7-9, April-June, 1922. 








we het ty TD 


_— —_— et eee 





_— — YF HS @ 


—_—_ FF we FF 











Results Obtained with Group Tests 483 
Age 12 

Girls Boys Total 
NN in aide tg erate an Dek 319 322 321 
i. ie, Nth i ae we hd odes 95 98 97 
ee 737 609 1346 
ee i a 3 
Cc vacuadaecewaas ene é 3.6 

















A study of the actual distribution shows no real differences. At 
both ages the standard deviations are slightly less for the girls than for 
the boys, but not markedly so. In all probability these two sample 
ages are sufficiently characteristic of the other age groups to conclude 
that there is no sex difference on this test. 

Conclusions.—Non-verbal tests do not correlate very highly with 
verbal tests. They are testing a different aspect of intelligence and 
one that should not be neglected. It is to be hoped that longer and 
better Non-language Tests than the one reported here will be made 
available. So far as the present test is concerned, it has shown a 
fair degree of reliability and validity. We need, however, more 
adequate criteria with which to compare it. As we broaden our con- 
cept of intelligence, we must broaden our criterion of intelligence. 
Obviously school marks in academic subjects and teachers’ estimates 
are in themselves too narrow. The nationality differences indicated 
so far are interesting. The results show the shrinkage in difference 
between nationality groups when the verbal factor is eliminated. 
They also show, however, that between the Italian group and the 
American group a true difference probably remains despite the elimi- 
nation of language. With regard to sex and insofar as this test is 
an index of concrete intelligence, we see no evidence of a greater 
“concrete”’ ability of boys as compared with girls. The greater 
interest and ability of boys in mechanical things as compared with 
girls is probably due to training rather than to inherent differences 
in intelligence. 
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THE PRESENT STATUS OF CHARACTER 
MEASUREMENT 


PERCIVAL M. SYMONDS 


Teachers’ College, Columbia University 


In 1921 many of the contributors to the symposium on “ Intelli- 
gence and Its Measurement” in the Journal of Educational Psychology 
stated that one of the next steps in research was the development of 
measurement of character. As the late Dr. Colvin put it, ‘The most 
important ‘next step’ for purposes both of prognosis and diagnosis 
is the formulation of a test that will inform us of the character qualities 
of those tested.’”’ Pintner stated: “‘I feel that the time is now ripe 
for active investigation of the emotions, the character, the will and so 
forth, by means of mental test methods.”’ Pressey said: ‘‘There 
simply must be a courageous attack upon the problem of measurement 
of other than intellectual factors. It is becoming increasingly obvious 
that matters of temperament and character are of very great importance, 
that they operate quite largely independent of intelligence, that prog- 
nosis problems cannot be adequately understood without an evaluation 


— of these factors.’’ Terman states as one of the next steps in research 


“investigation of instinctive, emotional and volitional traits and of the 
combinations of these which are involved in pre-psychopathic condi- 
tions and normal variations in temperament.’”’ Thurstone said, ‘I 
should like to see another line of mental test work opened up, namely, 
the diagnosis of the volitional and emotional characteristics which 
determine our character traits.’”’ It is not necessary to continue. 
The present paper is an attempt at stock-taking three years later to 
determine progress. If so many leaders in psychology believe the 
problem is worthy of study and has some possibilities of being fruitful, 
without doubt they will have made some contributions themselves 
and will have stimulated their students toward research along these 
lines. ' 

The review by A. W. Allport (1921) entitled “‘Personality and 
Character’? may be taken as a starting point for the present stock- 
taking. In this article Allport has given a complete summary of the 
work done on the measurement of personality by rating and testing up 
to the time that he wrote (Article published as of September, 1921). 
Rating, ‘‘although fraught with perils” he believes to be the “only 
available objective criterion of personality.”” The testing of tratis 
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other than intelligence traits has taken all sorts of bizarre forms. 
The testing of which Allport tells us represents a vague fumbling 
around for suitable approaches. The word-association method, 
various tests of motor impulses, including Downey’s early forms of 
the Will-temperament Test, various tests of ethical interpretation, 
judgment and discrimination, and various questionnaire methods 
practically exhaust these earlier attempts. 

The present summary does not pretend to be exhaustive. As 
I have surveyed the literature eight different methods seem to emerge 
as having taken more definite form than the others, as having been 
subjected to more extensive scrutiny, and as having proved themselves 
“hopeful.” These are ; 


1. Habit scales 

2. Character scales 

3. Self-assurance or overstatement tests 

4. A specific test of trustworthiness known as the “squares 
and circles’’ test 

5. A specific test of trustworthiness known as the paraffin comple- 
tion test 

6. Speed of decision tests 

7. The questionnaire 

8. Ethical judgment tests 


HaBit ScALES 


The Upton-Chassell Scale for Measuring the Importance of Good 
Citizenship (1919) seems to be the first scale attempting to give a 
rating scheme for conduct habits. The thing is so obvious that it is 
quite probable that this attempt is antedated by others which were 
not given the same publicity. As it originally appeared in 1919 the 
scale consisted of a list of habits comprising an inventory of the con- 
duct habits of the “good citizen.’”’ These habits are grouped under 
separate captions and the habits under any one caption are arranged 
in an order of importance as determined by the ranking of 74 qualified 
judges. This scale was made the basis of a part of the report card used 
in the Horace Mann School which attempts to make a character diag- 
nosis of the pupil. An earlier form of the scale had a system of weigh- 
ing such that the total possible for a maximum rating on all the habits 
was 1000 points. The validity of this scale is apparent—it obviously 
measures what it attempts to measure insofar as the ratings are accurate. 
The reliability of the scale was not determined. Realizing that the 
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total scale was a ponderous instrument to manage, the makers have 
prepared eight short scales which are selections from the longer scale 
(Chassell, 1922). Each short scale is divided into three parts, each 
part containing habits of different social importance, and the part 
containing the habits of greatest importance, being given the most 
weight. This is a curious unit for scaling. In another place I show 
that there is a high correlation (+.7) between “importance” and 
‘‘generalness.”’ In other words this becomes very nearly a scale of 
generalness. It is as though an arithmetic scale were constructed 
beginning with ability to add two and three and going through the 
phases of ability to add in columns, ability to add in general, ability 
with the four fundamental process, etc. The reliability of the scales 
has been found high with an average of .895 for 10 different classrooms. 
But unfortunately the subjective nature of the rating is such that this 
reliability does not hold between groups. However we have here a 
series of scales which distinguish between the members of a group as 
to their “habits of good citizenship”’ with considerable accuracy. 

A second habit scale has appeared in the “‘ Tentative Inventory of 
Habits” of Rogers (1922). This is merely lists of habits for small 
children but they might be easily made into scales similar to the 
Upton Chassell Scales. In fact, now that it has been found that such 
a rating scale possesses such high reliability the possibilities of exten- 
sion are numerous. All that is necessary is to inventory a group of 
habits which may be observed by any one person. Veverska (1923) 
also gives a list of habits for the four-year old. 

One special scale of this nature by Payne (1921) is entitled ‘‘A 
Scale for Measuring Personal and Social Behavior, Habits and Prac- 
tices in Health and Accident Prevention.”’ Although this scale. is 
quite detailed in its construction, it is scored on the all or none basis. 
The reliability has not been determined. | 


CHARACTER OR PERSONALITY SCALES 


Instead of rating habits these scales rate traits. The one is a 
dynamic thing, the other a static thing. Whereas habits are elements 
of experience and have a certain degree of reality, we are never sure 
of the reality or existence of traits, nor of their definition. 

The Benjamin Franklin character code was used by Franklin as 
the basis for a scheme of self-measurement. More recently Hyde in 
a book entitled ‘‘Self-measurement”’ set forth a scale for the measure- 
ment of 10 qualities of character. In the present movement the 
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character scales of Mendelhall (see Character Education Methods— 
The Iowa Prize Plan, 1922) may be mentioned. In their present 
tentative form they consist of a self-measurement scale for high school 
pupils and a self-measurement scale for children, Grades V-—VIII. 
For each of the 30 traits in the high school scale there are eight grada- 
tions, each gradation described by a brief statement, sometimes 
relating to conduct, and sometimes qualitative. The scale devised 
for the grades contains 22 traits, each trait with six gradations. Self- 
measurement has its pitfalls. (Knight and Franzen, 1922 and Hoffman, 
1923.) The correlations between one self-rating and a later self- 
rating ranges up to +.92. The authors donot state how low the corre- 
lations go or about what value they average. Of course, repetition 
of a self-rating would carry with it a reliability spuriously high. 
Correlations ranging from 0 to +.92 are reported between self-ratings 
and ratings by the teacher. Here again we do not know which is 
the more typical figure. 

A rating scale for rating personality has been constructed by 
Allport (1921) but as he has not subjected his scale to a critical study 
we may pass it over. 

Porteus (1920) has given us a “Social Rating Scale” with a three- 
point gradation, using the recommendations of Scott (1918) to secure 
reliability to reduce the halo of general impression. The traits which 
Porteus selected where those which indicated particularly the “social 
maladjustment of the mentally inferior or the temperamentally 
unstable.’”’ Porteus reports reliability coefficients of +.87 and +.85 
although they do not seem to be strictly reliable coefficients of the 
scale under consideration. This is considerably higher than the 
reliability reported by Rugg (1921-22) which as nearly as I can esti- 
mate corresponds to coefficients of about +.70 even under the most 
favorable conditions. The whole question of reliability of ratings is 
in flux. We have Rugg’s statement that a single rating even under 
experimental conditions has a probable error of five and six points on a 
100 point scale. Slawson (1922) tells us that the reliabilities of rating 
of the personal qualities of teachers by arank order method gives coeffi- 
cients of reliability ranging from +.335 (judicial sense) to +.603 (all- 
round value to service). Cady (1923) finds that there is higher 
reliability in a judgment when the rater is very sure of his judgment 
(r= +.865) than when he is only fairly sure of his judgment (r = 
+.479). A period of preliminary observation helps raise the reli- 
ability. Teachers can rate more reliably than the children themselves. 
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There is much hope, then, for obtaining reliable ratings provided 
ingenuity is exercised in securing (1) objectivity and (2) familiarity of 
the raters with the subjects to be rated. In this last connection 
Knight (1923) has shown that familiarity extending over a period of 
time leads to a personal factor in the rating which causes inaccuracy. 

Rating schemes are nothing new. They may be divided into two 
main groups: (1) Rating on a scale of assigned qualities or (2) ranking 
individuals in order on qualities. There is no evidence as to which is 
superior. Although rating schemes are not new there are recent 
developments with new features which seem to increase the value of 
the scales. 

1. Man to man comparison—the Army Rating Scale method. 

2. Indicating a judgment graphically by placing a cross along an 
unbroken line. 

3. Using descriptive phrases for various degrees of a trait to help 
define the differences in the trait more accurately. 

4. If the ratings are on a five degree scale, trying to obtain some 
normality of distribution by assigning the number who shall be rated 
each degree. 

5. Indicating the certainty of the judgment, Cady showing that 
judgments of which the rater expresses certainty are more reliable in 
general than those made with a less degree of certainty. 


SELF-ASSURANCE OR OVER-STATEMENT TESTS 


Leaving the field of rating we turn to tests, the obtaining of a 
record of the response of the subject to a certain situation. First 
among these is the over-statement test. 

Voelker (1921) is the first to propose such a test. The test answers 
the question: ‘‘Can the subject be trusted to make true statements 
in regard to his knowledge?”’ Simply stated, the test asks the boy to 
state his ability or his knowledge about certain things and later tests 
his ability or knowledge. Voelker does not subject this particular 
test to statistical analysis. 

Filter (1921) tried out performance tests of a similar nature. One 
was to estimate what. “string figures” the subject could reproduce 
and then to test his ability in actually doing it. His second test 
(called a ‘‘decoupage test’’) was to estimate his ability to ‘‘draw what 
the folded and torn paper looks like opened up.” Filter does not 
report the reliability of the tests. The string figure test correlates with 
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the decoupage test +.09. The test correlate with ratings of over- 
estimation +.70 for 6 cases and +.32 and +.28 for 14 cases. 

Cady (1923) uses a modification of the test described by Voelker. 
He finds a reliability of +.579. The test correlates with a criterion of 
incorrigibility +.414. 


SQUARES AND CiRcLES TEST 


Voelker (1921) calls this the cardboard test and it is No. 9 of his 
second series. It attempts to answer the question ‘‘Can the subject 
be trusted not to peep when he is placed on his honor to keep his eyes 
closed?”’ The stunt is to close the eyes and try to place a pencil 
mark in each of five circlesonacard. Since this is practically impossi- 
ble to do, it is safe to score the subject zero if he succeeds in placing the 
mark in all five circles even once in five trials. 

Cady (1923) extends this test to squares and mazes, the basic 
principle being the same. He finds a reliability of +.744. These 
three tests intercorrelate as follows: +.59, +.51, +.63. These three 
tests correlate with the criterion of incorrigibility + .396. 


PARAFFIN COMPLETION 


This test is first described by Voelker (1921). ‘‘Can the subject 
be trusted not to cheat in an examination?” is the question proposed. 
The subject takes a Trabue Completion Test on page 1 of a four page 
folder. Page 3 is paraffined and receives a copy of the original com- 
pletions. Then the subject opens the folder so that the test on 
page 1 may be scored by means of the answers on page 4. Additions 
or corrections in the process of scoring may be determined by com- 
paring with the paraffin copy. 

Cady (1923) also used this test. He found it to have a reliability 
of +.578 but its correlation with the criterion of incorrigibility was 
only +.188. 


SPEED OF DECISION 


Downey (1919) has a simple test of speed of decision in her Will- 
Temperament Test. Ruch and Del Manzo (1923) have subjected this 
test to critical study. It correlates for a group of 146 high school 
students with “‘speed of movement”’ +.37, with “‘freedom from load”’ 
+.33 and with “flexibility”? +.05, and with a composite of these 
+.39. 

Filter (1921) gives six tests of speed of decision, the first of which is 
similar to the Downey Test. The intercorrelations vary from +.31 
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to +.66 and average for the six tests as follows; +.51, +.44, +.47, 
+.52, +.438, +.46. It is evident that here we have a kind of test 
which shows itself worthy of further investigation. 


THE QUESTIONNAIRE 


The questionnaire is perhaps the oldest and most direct of methods, 
yet it has been suspected by psychologists because of its subjectiveness. 
Thorndike sums up the value of the questionnaire as follows: “‘Con- 
clusions about the facts studied only indirectly through the reports of 
incompetent observers, in the case of individuals representing a partial 
and undefined selection, compiled by a single and possibly prejudiced 
student, without the knowledge of the technique and logic of statistics 
are unreliable.’’ Cady seems to be the first one toc are to subject the 
questionnaire to a critical statistical analysis. Using an adapted 
Woodworth psychoneurotic questionnaire (see Hollingworth ‘‘ Psy- 
chology of the Functional Neuroses”’ or Franz, “Handbook of Mental 
Examination Methods”), Cady obtains reliabilities of +.55 and +.47. 
With such reliabilities, the questionnaire takes its place with the other 
tests which we have been discussing. The correlation of the question- 
naire with the criterion of incorrigibility is +.36. 

Test 4 of Pressey’s (1921) X-O tests is in reality a questionnaire. 
It asks the subject to “cross out everything (in a list of words) about 
which you have worried or felt nervous, or which you have ever 
dreaded.” A revision of this test (Form B) contains another question- 
naire asking the subject to “cross out (in a list of words) everything 
you like orare interested in.””’ To date there has been published no crit- 
ical study of the tests, although Pressey promises “that such examina- 
tions will be more accurate than the Army Scale Alpha in prognosticat- 
ing unsatisfactory work in college.” 

Undoubtedly the questionnaire is capable of becoming a useful 
instrument in combination with other tests for measuring certain narrow 
aspects of character. | 


TESTS OF JUDGMENT OF MORAL TRAITS 


This form of test is not new, but there have been some recent 
developments and study of the method. Binet incorporated “‘ What 
should you do?” questions in his scale which are samples of moral 
judgment tests. Test 3 or Army Alpha contained moral judgment 
questions. Colvin (1922) described a moral judgment test constructed 
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by Liao, following the plan of Test 3 of Army Alpha. A sample 
question is: 


I. It is wrong not to work. 
1. Idle people are called lazy. 
2. Idle people earn no money. 
3. Idle people are discontented. 
4. Idle people live on the works of others. 
5. Good men tell us we should work. 


No mention is made of any critical study of these tests. Quite obvi- 
ously reading ability plays a large part in the score. 

Brogan (1923) had over 500 persons in the University of Texas 
rank 16 practices in order of “badness.”’ No use was made in the 
experiment of individual results. The author was interested rather 
in constructing a ‘“‘badness”’ scale. 

Certain of the tests in Pressey’s earlier cross-out tests and in his 
later X-O tests are moral judgment tests. As a sample take the 
following: “‘ Read through the 25 lists below and cross out everything 
that you think is wrong—that a person is to be blamed for.”’ The 
Pressey Tests have not been studied. 

Cady uses a word judgment test. He finds a reliability of +.86 
for intelligent adults, but only +.38 with children used in this study. 
This test correlates +.30 with the criterion of incorrigibility and +.34 
with intelligence. 


CRITICISM 


I should like to offer three points of criticism of the movement to 
date to measure character which I hope will be constructive, or at 
least will stimulate discussion. 

In the first place, are we trying to measure something that actually 
exists? When I read over a list of traits such as intelligence, neatness, 
humor, beauty, refinement, sociability, likeableness, snobbishness 
conceit, vulgarity, I am wondering if there is any one thing that corre- 
sponds to these names. It smacks very much of “faculties.” ‘‘ What- 
ever we may name exists,” is a tacit assumption that we too easily 
make. If we have found that memory and imagination are particular, 
depending on the material with which they deal, so much the more 
snobbishness, vulgarity, or even honesty and trustworthiness. A 
man may be snobbish at his club, but friendly with his servants at 
home; a man may be vulgar with men and genteel with women; even 
trustworthiness depends on the situation. To some of these things, 
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such as trustworthiness, there are undoubtedly original tendencies, 
otherwise we would be at a loss to explain ‘“‘light fingeredness.”’ 
Where there may be strong individual predilections, we may suppose 
an original tendency. But so much of trustworthiness depends on 
the situation, on previous training, on the degree of social approval or 
disapproval. So before we try to measure trustworthiness, incorri- 
gibility, self-assurance, or what not, perhaps it would be worth while 
to question the existence of these things as universal individual quali- 
ties that work quite regardless of the special situation. Low test 
intercorrelations may be due to just this—that slight change of the 
situation will lead to a totally different reaction. This is no cause for 
discouragement. It should make us more analytical and more prone 
to admit the independence of abilities even though they be more or less 
correlated. This criticism really strikes at the root of our concept of 
character. Whatischaracter? Isit an entity, having a static positive 
existence of its own which issues into conduct, or is it merely the sum 
total of our conduct tendencies? Of course the latter fits in better with 
modern psychological concepts. If we believe character to be largely 
native then we are justified in attempting to measure general qualities; 
if we believe the character is largely acquired we ought to examine 
critically ‘qualities’ or ‘characteristics’? and proceed more analyti- 
cally. This latter procedure demands more attention. 

Secondly, perhaps it is human nature to do something only for a 
direct purpose. I believe that our progress in character testing would 
be more rapid if we worked with an indirect purpose. Would not 
progress be more rapid if instead of trying to construct “valid” tests 
we try to construct “reliable” tests? Every test maker is appar- 
ently trying to measure. something—the emotions, trustworthiness 
incorrigibility, speed of decision—that is, hé is trying to construct 
valid tests. Would not progress be more rapid if we attempted to 
construct reliable tests, regardless of the specific thing they measure? 
It goes against all of our research ideas to work with this motive. To 
the question ‘‘ What are you trying to measure?” the answer “I don’t 
know what I am trying to measure but I am trying to measure it 
accurately’’ would verge on the ridiculous. Yet this is precisely the 
process and point of view which is going to mean the fastest progress. 
Our greatest need is for reliable tests and for criteria of reliablity. 
Effort is wasted in building valid tests (tests that measure something) 
but are so unreliable as not to give the same answers on a second trial. 
But once having constructed a reliable test, it is comparatively easy 
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to find out what it measures and all is gain. Test makers put out tests 
purporting to measure certain general and important “traits of charac- 
ter.” But if the tests possess zero reliability there is no gain—the 
gain seems to be only in proportion to the reliability achieved. It is 
perhaps too much to get people to set to work blindly trying to con- 
struct reliable tests without at the same time trying to measure some- 
thing in particular. But at the present time reliability is more 
important than validity. 

It is also contrary to human nature to be content with measuring 
anything less than the most important traits. But as I maintained 
earlier in this paper the most important traits are also the most general 
traits. Here again perhaps the line of most progress is in the attempts 
to measure very specific traits or habits. Of course every test does 
this—it measures a very specific response to a very specific situation. 
But the test maker blindly interprets this as a general reaction. 
There seem to be two fundamentally different kinds of general traits. 
Thrift is an example of one of these. Thrift seems to be a bundle of 
more or less loosely connected special habits—habits with regard to 
and conservation of materials, earning, saving, spending and repairing. 
So an index of a person’s ‘‘thriftiness’”’ would be an inventory of his 
responses to all these various situations. A second kind of general 
trait is well exemplified in neatness. It is the individual’s response to 
a single element in a number of different situations. I have elsewhere 
called such a trait a confact (cf. concept) to use a word which may 
acquire a connotation in harmony with behavioristic notions. A con- 
fact is a conduct response (as opposed to a mental or verbal response) 
to a common element of various situations. It is these confacts that 
workers have been interested in, in their attempts to measure charac- 
ter. But the confact must be tested in more than one situation 
Otherwise we have tested the confact no better than we have tested 
the concept green when a color blind child picks out a green worsted 
from a pile of four colors by familiarity with the knot in which it is tied. 

We need to come down out of the sky and think less in terms of 
kinds of personality, or of traits of character and more in terms of 
habits of conduct, or specific reactions to well defined situations. 


PossiBLE LINES OF DEVELOPMENT 


I suggest what seem to me to be profitable lines of attack: 
1. Measurement of Health—Why not a health scale? At the 
present time there are several health indices—various nutrition indi- 





oat oe ee 








“Te ee ae? 
Ser ~~ =—_ 4 - 








494 The Journal of Educational Psychology 


ces, Sargent’s index, Schneider’s index, the Foster Test, the Cramp- 
ton Test, strength tests, athletic proficiency tests. Could not these 
be combined by multiple correlation into a health rating that would 
give a better correlation with a criterion of health than any one 
test alone? Only the barest beginning has been made at finding the 
intercorrelations of these important health indices (Finklestein and 
Williams, 1922). It would also be valuable to know the worth of 
health rating scales (Are they more reliable than rating of less objective 
traits?) and of health questionnaires of children. What relation is 
there between health habits and health, or is the absence of any habit 
sufficient to make a distinct influence on health? 

2. We need very badly a scale of generalness of conduct habits, 
determined as accurately and scientifically as possible, that may serve 
as a basis for ratings. What is the correlation between positions 
and such a scale and reliability of ratings? Such a scale might turn 
out to be the key for scale rating reliability. (Of course other factors 
such as acquaintance enter also.) 

3. Measurement of Ability to Study.—An inventory of study habits 
would be a beginning. What is the correlation of intelligence and 
study habits? And what is the correlation of achievement and study 
habits? What are the partials? 

4. Measurement of Manners.—One method would be to inventory 
the manners habits. What is the value of the questionnaire in this 
connection? How closely does knowledge of manners correlate with 
the actual habits? One lead is the use of “‘What’s wrong in this 
picture?” as a test of knowledge. 

5. What is the correlation between moral judgments and ratings on 
the Upton-Chassell scale? 

6. Can the questionnaire be used as an attitudes test? 

My estimate is that with the development of tests designed to 
measure special features of conduct which prove themselves reliable, 
we will have at the same time tests which will also correlate with more 
general criteria and may be used in batteries for the measurement of 
‘“‘character,’’ much as intelligence tests are batteries of more specific 
tests. 
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Since the above was written several developments have taken 
place in the measurement of character which show that real progress 
is being made and that lines of attack are being opened which show 
considerable hopefulness. 

First of all, Cleeton and Knight (1924) have demonstrated beyond 
any possible cavil that physical characteristics, particularly of the 
physiognomy, have no predictive value for character. This has been 
part of the scientific psychologist’s mental equipment for some time 
(see Dunlap, 1922), but the value of such external signs has always 
been the shibboleth of psychological charlatans. 

Chassell (1924) has developed a variation of the moral judgments 
test called a test of ability to weigh foreseen consequences. After a 
story is read to the group to be tested the members of the class indi- 
cate by marking plus or minus which of a number of possible con- 
sequences of the incident in the story seem desirable and which 
undesirable. On the basis of these decisions the children are finally 
to judge as to the rightness or wrongness of the moral problem involved. 
The test has a reliability of .87. This test has yet to show that it is a 
practical instrument. 

The most hopeful development’ comes from Iowa. Hart (1923) 
developed a variation of the questionnaire which seemed to have 
considerable capacity for differentiating individuals on the basis of 
various social attitudes such: as ‘general altruism,” “inter racial- 
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mindedness,”’ ‘‘international-mindedness” and the like. As a sample 
of the method, several activities are mentioned after which are printed 
both plus and minus signs. The student indicates his like or dislike 
of an activity by encircling a plus or minus sign. The five things 
which he feels strongly about he underlines and the one thing which 
he feels most strongly about he double underlines. Shuttleworth 
(1924) experimented with the method and found that such an instru- 
ment measured money-mindedness with a validity of +.95, the cri- 
terion being students’ ratings of one another; and a reliability of .901. 
Best of all, the test seems to be cheat-proof to a high degree, a quality 
that is difficult enough to obtain even for an achievement test. 

These results have been so promising that Shuttleworth is now 
working on a similar test to measure the character traits other than 
intelligence involved in scholastic success. Such a test has been 
given this fall to all freshmen entering the University of Iowa. 
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THE GEOGRAPHY OF INTELLIGENCE 
RAYMOND FRANZEN 


University of California 


The Problem.—Measurement should precede definition. Some 
of the critics of classroom use of psychological tests assume that the 
converse is true. They demand that we shall fully define intelligence 
before we place any faith in the tests. This is comparable to discard- 
ing our watches until time is defined and understood, instead of use- 
fully defining time in terms of watches. Our only hope for clarity in 
the interpretation of the qualities of citizenship is by way of enough 
measured evidence. The variables which constitute educational 
values will ultimately be defined in terms of the measurements which 
prove to be objective and reliable. 

A group of sincere critics of education policy has concentrated 
upon the degree to which results upon the so-called ‘‘intelligence”’ 
tests are determined by prenatal causes. Their argument often is a 
naive insistence upon the acquired nature of the medium of measure- 
ment. They agree that children learn to read and figure and know the 
opposites of words. They infer from this that these media must test 
nurture and not nature. All measurement, including that achieved by 
psychological tests, is indirect. We measure heat by means of obser- 
vations on the expansion of mercury and we measure the common 
factor in the learning process by means of the degree to which chiidren 
have learned and do learn ordinary activities. An ideal criterion for 
what we usually mean by “intelligence”? would be the disparity of 
achievement of 100 children in 100 activities when the conditions of 
training were as near perfect as possible. Given a surfeit of intellectual 
opportunity what differences are there in the speed and facility 
of learning? 

Most of the evidence which makes more tenable the point of view 
that nature and not nurture is in the last analysis responsible for 
individual differences in facility of learning, has been formulated 
before our present group tests were perfected and therefore is not 
easily translated into our present need. Galton and Cattell have shown 
that eminent parents haveeminent children. Goddard and others have 
proved the transmission of feeblemindedness. Spearman and others 
have upheld the position that there is a large common factor in achieve- 
ments of various kinds. Hollingworth has indicated increased varia- 
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bility of achievement with increased training. Franzen has shown 
that the correlations between the Binet Test and school achievements 
increases with training. Terman and his pupils have contributed 
evidence of correlations between IQ’s of parents and children. These 
and other investigations determine a genuine inherited common factor 
of learning. 

It is desirable to trace the existence of this common factor in the 
group tests we now use. Evidence thus far presented does not justify 
an inference as to the relative share of heredity and environment. 
Thus the results on the Alpha Test by geographical areas have been 
correlated with the educational status of those areas, but a high corre- 
lation would be equally probable, whether Alpha measured inherited 
or acquired traits since low educational status may mean either or 
both. A conclusive experiment of the relative réle in group tests of 
heredity and training is perhaps reserved for the future. It is possible, 
however, to show how nearly constant these measurements are so as 
to force any exponent of the nurture theory to shoulder his full quota 
of responsibility. 

If the 10 year-olds in a particular school are much lower than the 
10-year-olds in another school in the same city, but the 13-year-olds 
in this first school are better in that same test than the 13-year-olds 
in the second, then we could say that whatever is measured has no 
definite relation to the location of the school. But if a school’s rating 
among other schools of the city remains the same, no matter what age 
group or what test is used, then we must admit that the geographical 
character of the school determines its intellectual placement. In that 
case, the protagonist of nurture must be willing to condemn the present 
organization of society at the same time that he claims that the differ- 
ences in test scores are not the result of innate differences, since then 
average scores of children in kindergarten, average ability throughout 
the elementary school, average success in high school and entrance to 
college are all correlated with the sociological status of a school. 

The question which this paper raises is: Are we in an unfavorable 
environment because we get low scores on an intelligence test, or do 
we get low scores because we are in an unfavorable environment? 
We hope to force those who would claim that we get low scores because 
of acquired rather then inherited causes to see a dilemma resulting 
from consistency of ratings. If once low means always low, then they 
must call democracy as expressed by our present-day school system 
a failure or they must relinquish their position. 
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The Tables.—In all the following tables the variables are test records 
made in different schools and the record of any one test for any one 
group of any one school is the average on that test of the children in 
that group in that school. Thus, if 10-year-old Wylie is one variable 
and Kindergarten Group B Park-Franzen is the other, then each school 
has a record on the Wylie which is the average score of the 10-year-olds 
in that school and each school has a record on the Park-Franzen which 
is the average score of the Kindergarten Group B children in that 
school. The correlation of these two variables measures the associa- 
tion among these schools between having high or low Wylie scores for 
their 10-year-olds and high or low Park-Franzen scores for the low 
kindergartens. No inference is here made regarding the variability 
of the children within a school. We are concerned with correlations 
of various traits which are characteristic of schools as a whole, not of 
individuals. 

The dates of the tests are different, so that records of schools on 
the Wylie at one time are correlated with records of schools on the 
Woody-McCall at another time, etc. The dates of any one test are 
of course the same for all schools. 

The Wylie results are for Forms A, Band C. The Woody-McCall 
results are for Form II. The Park-Franzen results are for Form I. 
The Spelling results are number of words spelled correctly out of 75 
words taken from the column of the Buckingham-Ayres Spelling 
Scale which allows 50 per cent incorrect for Grade V at mid-year. The 
Handwriting results are the average of six teachers’ judgments, using 
the Thorndike Handwriting Scale. 

Table I shows the consistency of the differences between schools for 
various groups in each of the testsused. Somereliabilities are included 
where it is necessary. We measure consistency by the correlation of 
results for different age or grade groups. We measure reliability by 
the correlation of results on two different measurements of the same 
age or grade group. When consistency is as high as desired, it is not 
necessary to measure reliability, since then reliability must be high. 
Consistency is one of our major contentions in this paper. 

The other contention is supported by Table II and is the significant 
correlation between placements of schools in various tests given to 
various groups with placements of these schools on the Wylie given to 





1The standard deviations of each age and grade group in each school are on 
record but make too bulky a table for inclusion in this article. 








‘i 
‘ 
¥ 
i 
} 
7 


Pa 


a ae 
ee 


a Ped 


S) Set ete SP 


gi ee 


z = Sesueen & 


502 The Journal of Educational Psychology 


age groups.! Some of the divisions of Table II show correlations 
between variables which use records obtained from Grade V data alone; 
some use Grade VII data alone, and others use a score which combines 
those of various groups. These last are called totals. They are the 
sum of averages unweighted. Thus, the Wylie Total for each school 
is the sum of the Wylie averages for each age, 8 through 14, as obtained 
in that school. The Arithmetic Total is a similar combination of the 
records for Grades IVa, Va and VIa. The Spelling Total is a combina- 
tion of the records for Grades Vb, Va, VIIb, and VIIa. The Hand- 
writing Total is a combination of the records for Grades IV, V and VI. 
The Park-Franzen Total is a combination of the records made by 
Kindergarten B and Kindergarten A. 

Table III gives rank placements of eleven schools in some important 
variables. (N is limited to 11 schools because all measurements were 
available for these only.) 

Tables IVA, IVB and IVC give intercorrelations of the variables 
with which the Wylie was correlated in Table II. Table IVD gives the 
correlations of Arithmetic and Spelling with Wylie when Handwriting 
is constant. Table IVE gives the correlations of Arithmetic and of 
Spelling with Handwriting when Wylie is constant. 

Table V includes the standard deviations of the variables used. 
They are given in order that the reader may understand the effect 
that the standard deviations have upon the size of r in the various 
groupings. ‘The fifth and seventh grade r’s are not comparable to the 
others since they are calculated from single M’s of tests and not com- 
binations of M’s. 7 

Conclusions.—1. The portion of a city in which a school stands is 
a reliable determination of its intellectual status as measured by the 
average of age and grade groups (Tables IA and IB). The rating 
a school enjoys relative to. the scores of any age or grade group is like 
that yielded relative to the measurement of any other age or grade 
group. This consistency becomes somewhat less the further the age 
or grade groups are apart, but is quite high throughout. Then adjust- 


1It would be desirable to precipitate statistically the common factor in the 
distinctions between these schools and correlate it against average income of 
parents, value of the property on which the school stands and other economic 
criteria. The writer is familiar with the school system treated in this paper and 
would predict that the correlation between average income of parents and the 
common factor behind the relative excellence in achievement of each school is 
high. Thus, Hubbell and Elmwood parents are well-to-do and McKinley and 
Washington parents are not. 
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ments of texts, curricula and methods may be made to the status of a 
school without fear of immediate change. Differences between schools 
are a function of some sociological factor associated with their 
geography. 

The differences in size of correlation, however, decrease regularly 
with the size of the time interval between the groups correlated. This 
suggests that sociological and economic changes had begun to interfere 
with the status of some of these schools at the time of the testing. 
Certain schools have a somewhat different status relative to far 
removed time groups. Changes in clientele are in process. They 
affect some age groups more than others. Some districts, for instance, 
may be becoming less desirable residence districts. Then relative to 
children in other parts of the city, the younger children are less intelli- 
gent than are the older ones. Inother districts the more stable portion 
of the community may have less economic success than the part which 
moves in early and moves out before the children reach the higher age 
groups. 


When = values are obtained for individual schools, such inter- 


pretations as these may be made for those schools which show less 
than the general consistency between groups. For instance: 


In Parx- 
FRANZEN 
KINDER- In 10- In 11- In 12- In 13- In 14- 
GARTEN B YEAR-OLD YEAR-OLD YEAR-OLD YEAR-OLD YEAR-OLD 
DistrisvuTion DistrictTion DistripscTion DistrisvTion DisTrRisvTion DISTRIBUTION 
Sabin .43 .3l .25 .05 —1.16 —.72 


Sabin is in a district where apartment house conditions and rents are 
such that you would expect a greater proportion of intelligent parents 
to belong to the younger children. These intelligent parents would 
move to another district as their economic status improved, leaving 
the children of the less gifted parents in the upper grades. 

2. Variability in intelligence is not consistent throughout age 
groups (Table IC). We cannot reliably predict the variability in the 
Wylie of one age or grade group in a school when we know the vari- 
ability of another age or grade group as we can predict the average 
score of one group from another. Then the need for classification is 
not a constant feature of a school, but must be determined for groups. 
Variability is not associated with geographical location. 

3. As the Park-Franzen distinguishes schools when given to the 
Kindergarten A classes, so does it distinguish schools when given to 
the Kindergarten B classes. Whatever it measures is correlated with 
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the geographical location of schools (Table ID). This is also true of 
the Woody-McCall in Grades IV, V, VI and VII (Table IZ). This 
is not as much true of spelling as measured in Grades V and VII nor 
of handwriting as measured in Grades IV, V and VI (Tables IF and 
IH). Still these latter two tables represent reliable measurements of 
schools and of individuals (Tables IG and IJ). The spelling and hand- 
writing as measured vary more independently of the geographical 
location than do our other measurements. Since handwriting and 
spelling as measured are more readily susceptible to training, irre- 
spective of intelligence, than Park-Franzen or the Woody-McCall, this 
may indicate what the sociological variable is. It is probably some 
economic variable associated with intelligence, in which case this 
difference is one we would expect. 

4. Distinctions made in schools by the Park-Franzen given in the 
kindergarten, the Wylie given in any age-group, the Woody-McCall 
given in Grades IV, V and VI and the spelling given to Grades V and 
VII have a common factor. Whatever influence makes a school 
good or bad in these tests is a consistent influence. Correlations of all 
variables except handwriting are high with the Wylie. The environ- 
ment has not exerted the mysterious ‘‘saving”’ influence claimed for 
it (Tables II and III). Handwriting, however, varies independently 
of location of school. Education has overcome original differences in 
this ability. No matter in what portion of a city a school is located, 
it may aspire to rank first in respect to the quality of its handwriting, 
but not in respect to its arithmetic or even spelling, because these 
are definitely correlated with success in Wylie Opposites. The rank 
of a school in these abilities in the upper age and grade groups may 
be measured fairly accurately by the rank of its kindergarten (Table 
III). Is it not reasonable to expect those distinctions to hold in the 
other direction and to be inherent in the individuals at birth? 

It is to be noted here that though intercorrelations of groups are 
lower in spelling than they are in arithmetic (Table I), still spelling 
correlates only a little lower with the Wylie Test than does arithmetic. 
Consistency is a little lower, but correlation with the common factor 
is still high. This means that the changes in the position of schools 
which are achieved by training are unimportant relative to the large 
differences due to some unchanging condition associated with geograph- 
ical location. Deviations expressed in multiples of sigma are included 
in Table IJ so that the reader may, if he desires, trace the lack of 
consistency in spelling to the individual schools responsible. 
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5. It is possible to view differences of schools in handwriting which 
are irrespective of intelligence, as differences in efficiency of training. 
Then differences of schools in arithmetic or spelling which are cor- 
related with differences in handwriting irrespective of intelligence, 
might be interpreted to be differences due to training. Of course, it 
might also be true that differences of schools in handwriting with 
intelligence constant were unrelated to differences in arithmetic or 
spelling with intelligence constant. In the latter case there would 
be no correlation of these subjects with handwriting when intelli- 
gence was constant. Also the correlations of intelligence with these 
subjects would be as high with handwriting constant as they were 
originally. 

The partials were therefore calculated (Tables IVD and IVE) 
in an effort to trace the influence of intelligence and training (or 
environment). The correlations of arithmetic and spelling with intelli- 
gence when handwriting is constant are practically the same as those 
in Table II. This means that when deviations for each school in 
achievement and in intelligence are taken from the average achieve- 
ment and average intelligence of their own class of handwriting, the 
correlation remains the same as when the deviations are taken from 
the average achievement and intelligence of the whole group, whatever 
their handwriting may be. Then the factors, whatever they may be, 
which cause differences in handwriting—and are over and above those 
factors which cause differences in intelligence (Table II1D)—are 
unassociated with differences in arithmetic and spelling. Tables 
IVA, IVB and IVC are the background for this conclusion. 

The correlations between handwriting and achievement when 
intelligence is constant, give another aspect of this same conclusion 
(Table IVE). Though there is some indication here of a slight 
association, it is not marked enough to justify emphasis. It would be 
interesting to get a better estimate of training than that obtainable 
from our handwriting records and then to get the correlation of this 
with achievement when intelligence is constant. In the 12 schools 
using totals, the correlation of arithmetic with spelling when intelli- 
gence is constant is .31. This would suggest that there is some factor 
of training which distinguishes these schools in some common way in 
both arithmetic and spelling, but that the factor is smaller in influence 
than the common factor of excellence in achievements which is asso- 
ciated with their geographical location. 
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We are willing to entertain the hypothesis that the differences in 
achievement among schools (differences which are consistent, which 
are predicted by differences in the Park-Franzen scores and which are 
correlated with differences in Wylie Opposites) are differences which 
may be removed by better methods of teaching though we are skep- 
tical of this hypothesis. We are not willing to entertain the hypoth- 
esis that differences associated with economic or social status disappear 


in our present educational careers. That is contrary to fact. 


TaBLE [A.—INTERCORRELATIONS BETWEEN WYLIE AVERAGES FOR VARIOUS 























AGE-GROUPS 
(27 schools) 
10 years | 11 years | 12 years | 13 years | 14 years 
a 3 ei aS a 0 oe | .86 
Beare sr ane 2) Boer > gee | .77 .92 
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TaBLE IB.—INTERCORRELATIONS BETWEEN WYLIE AVERAGES FOR VARIOUS 














GRADE-GROUPS 
(23 schools) 
Grade | Grade | Grade | Grade | Grade 
IVb Vb VIb VIIb | VIIIb 
a .57 
ee steak Ge eee « .71 .73 
Bia as Nota wird cue cee .54 .58 .72 
I bh So it ks yikinse bale c's aM .50 .58 .69 71 
REE i eee et pe ae 24.17 | 36.35 | 49.30 | 60.96 | 72.04 
PER TEI SE 10.59 | 11.59 | 11.43 | 8.73 | 9.48 




















1 The sigmas here are of course computed standard deviations of means. 
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TaBLE IC.—INTERCORRELATIONS BETWEEN WYLIE SiGMAs ror VARIOUS 











GRADE-GROUPS 
(23 schools) 
Grade IV Grade V Grade VI 
ES PPT TT er ee ee — .08 
snc wed cadedy cessckeeemhee — .46 + .08 
EE Se Cee HORT Se we 17.31 19.85 20.65 
ce ae 5.68 4.02 3.66 














1 The sigmas here are of course computed standard deviations of standard 
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the data in Table I for each of these grades. 


TaBLE I1D.—CoRRELATION OF PARK-FRANZEN Cxiass A AVERAGES WITH CiAss B 
AVERAGES 
(21 schools) 


o of Class B = 6.64 
o of Class A = 5.83 


.85 


TaBLE LE.—INTERCORRELATIONS OF Woopy-McCaLtL ARITHMETIC AVERAGES FOR 
Various GRADE-GROUPS 
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TaBLE IF.—CoRRELATION OF BUCKINGHAM-AYRES SPELLING AVERAGES FOR 
GRADEs V anp VII 


(17 schools) .59 


o of Grade V = 11.41 
o of Grade VII = 12.11 


TaBLE IF.—INTERCORRELATIONS BETWEEN SPELLING AVERAGES FOR VARIOUS 
GRADE-GROUPS 


(17 schools) 








| Grade Grade Grade Grade 
Vb Va VIIb Vila 
ES oe .84 | 
Ee ae ae .39 .32 
RE RE ee .62 .59 .63 
as OO ie ee en kG 4.77 fet 6.19 6.74 

















Taste IG.—CoORRELATION OF BUCKINGHAM-AYRES SPELLING AVERAGES IN GRADE 
Va—25 Worps with ANOTHER 25 Worps 


(17 schools) .90 


o of lst 25 words = 15.24 
o of 2nd 25 words = 14.95 


TasLE 1H.—INTERCORRELATION OF THORNDIKE HANDWRITING AVERAGES (FOR 
Quatity) ror Grapes IV, V anp VI 


(27 schools) 





Grade IV Grade V Grade VI 








CO OG FCT Ee > .60 
ed ee ale bebe ewe ween .52 .52 
OE an RR se i i oe .55 .55 .57 














TaBLE IJ.—AVERAGE OF 14 CORRELATIONS BETWEEN INDEPENDENT MEASURE- 
MENTS OF 48 HANDWRITING SAMPLES (THESE MEASUREMENTS BEING 
Maps WITH SAME TECHNIQUE AS THORNDIKE HANDWRITING 
ScaLe. UsEs) 


.82 
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TaBLE IJ — FoR 17 Scuoois 1n Spettinc—Grapes Vb, Va, VIIb anv VIla 














Grade Grade Grade Grade 
Vb Va VIIb Vila 

CS er —0.42 —0.42 1.45 0.45 
| ES ee. eee 0.00 0.42 0.65 1.19 
os ain, an Ud wo eile bi 1.68 0.84 —0.65 1.19 
No oie aie oi ae ae Lad 0.84 0.42 0.16 —0.15 
CS cP cc go tsteeeuecee 0.21 0.28 0.65 0.45 
CS steciveius sete Ge ca 1.26 0.70 0.65 1.48 
a oe gas 4 aia ake eb-a 0.84 1.27 0.32 0.59 
RS Tee ere eee 1.89 2.25 0.48 0.59 
te Cas ace pea ad —0.21 0.00 —0.48 —0.89 
GS Cova eet ce ene es cea 0.00 0.00 0.65 —0.15 
eg NN bat cha ie —0.21 —0.98 —1.78 —1.48 
iadwata ee s4as0atitetae> 0.00 —0.98 0.97 1.19 
IRR ins a —1.26 0.14 | —0.16 0.15 
ais 0 A uta sds Sumserewne ae —0.63 —0.28 1.13 0.00 
S|. Meee cn nee | —0.68 | —0.42 | -—1.45 | —1.78 
se re Tee | 1.47 —1.55 —1.29 —0.74 
Washington.................... —1.6 |.=2.07°] —1.@ | —1.68 














TaBLE IIA.—CoRRELATIONS OF WYLIE WITH PARK-FRANZEN! 
Total of Park-Franzen (Kindergarten A and B) with total of 


Wylie (all age groups (20 schools)..................00000- .79 
Park-Franzen Kindergarten B with Wylie 10-year-olds (24 
Es ons SRGE Reka soo Cee RanGins otha bob0c sees cones .70 


Taste IIB.—CorRRELATIONS OF WYLIE witTH Woopy-McCa.u? 
Total of Woody-McCall (Grades IVa, Va and VIa) with total 


of Wylie (all age groups) (21 schools)..................... .73 
Total of Woody-McCall (Grades IVa, Va and VIa) with total 
of Wylie (all age groups) (12 schools)..................... .76 


Woody-McCall Grade Vb with Wylie Grade Vb (17 schools)... .58 
Woody-McCall Grade VIIb with Wylie Grade VIIb (15 schools) .57 





1 Incidentally this table is a genuine validation of the Park-Franzen Test. 

2 When grade-groups and age-groups are correlated, it must be remembered that 
differences in age-grade status will lower these r’s. Washington and McKinley 
sixth grades have much older children in them than Hubbell and Elmwood sixth 
grades. It follows that the correlation between Woody-McCall and Wylie as 
well as the correlations in the following tables which involve both grade and 
age groups, would be higher if both variables had been computed for age groups. 
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TasBLE IVC.—CorRRELATIONS OF WYLIE wiTH BUCKINGHAM-AYRES SPELLING 


Vb, Va, VIIb and VIIa) (17 schools)...................... .69 
Total of Wylie (all age groups) with total of Spelling (Grades 

Vb, Va, VIIb and VIIa) (12 schools)...................00. .67 
Wylie Grade Vb with Spelling Grade Vb (17 schools).......... .60 
Wylie Grade VIIb with Spelling Grade VIIb (15 schools)...... .49 

TaBLE IVD.—CoRRELATIONS OF WYLIE WITH THORNDIKE HANDWRITING 
(QUALITY) 

Total of Wylie (all age groups) with Handwriting averages 

for Grades V and VI (27 schools)..................e.00. .02 
Total of Wylie (all age groups) with total of Handwriting 

(Grades IV, V and VI) (27 schools)..................... .08 


Taste III.—RaAnkKs FOR THE 11 ScHoots THat Hap ComMpLeTE RECORDS FOR 


The Journal of Educational Psychology 


Total of Wylie (all age groups) with total of Spelling (Grades 


Total of Wylie (all age groups) with total of Handwriting 
(Grades IV, V and VI) (for 10 schools included in above 27 
schools but having no spelling records).................. — .05 

Total of Wylie (all age groups) with total of Handwriting 
(Grades IV, V and VI) (for 17 schools included in above 27 
schools and having spelling records)..................... .25 

Total of Wylie (all age groups) with total of Handwriting 
(Grades IV, V and VI) (for 12 schools having spelling and 
a ee a te ena s wae .18 

Wylie Grade Vb with Handwriting Grade V (17 schools).... .01 

Wylie Grade VIIb with Handwriting Grade VII (15 schools) — .26 








Aut. TEsts 
‘ ag ic Pe ; Ee o > 

oS | HEL | SEE | Fass | eee 

sto | sha | 4 B2—es | SED 

328 | &Sy SE | se84 | Sez8 

sb G22 | aH | dele | Had. 

oh g¥< | go> | 20> | &mSo> 
EE re a 1.5 3 1 4 9 
ee een dia ahh < nln 64 » 1.5 1 3 3 4 
a eee Te eee _ 2 4 6.5 3 
ka tiedendem Abed <pes] 4 6.5 8 5 11 
Wallace Whittier........ 5 4 5 8 6 
EE ee ee —(C«“ 5 2 2 2 
SS | 7.8 6.5 9 9 7 
ESE ey Se eee (7.5 9.5 6 1 5 
EE re Aer ee | 9 8 7 6.5 1 
in its tak eee cutee ke 4:6 | 10 9.5 10 10 8 
sikh: odd nine shaleles « | il 11 11 11 10 
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TaBLE IV A.—INTERCORRELATIONS BETWEEN GRADE Vb Recorps (GRADE V IN 


CasE or HANDWRITING) 








(17 schools) 
Wylie Arithmetic | Spelling 
ns cca ak bale tee ee a nag .58 
GS Vea a V0 s Dee sad eeaee eee .60 11 
ts Saris baGe chien e ee ees 01 .005 21 














TABLE I1VB.—INTERCORRELATIONS BETWEEN GRADE VIIb Recorps (Grape VII 
IN CASE OF HANDWRITING) 


























(15 schools) 
Wylie Arithmetic | Spelling 
pO ery ard. rer eee eee .57 
CURSES Be ee pias ee ag Se .49 31 
PN, Sse Sawa Sea. 804 — .26 .10 — .24 
TaBLE IVC.—INTERCORRELATIONS BETWEEN TOTALS 
Wylie for All Ages 
Woody-McCall Grades IVa, Va, VIa 
Spelling Grades Vb, Va, VIIb, VIIa 
Handwriting Grades IV, V, VI 
Wylie Arithmetic | Spelling 
Ng ckcgeudidaedpu@elede kins .76 
aii s iw saben + eee ae eae 0463 . 67 . 66 
Si dccdipnes na euencnadie © 08's .18 .33 . 26 
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TaBLE IVD.—ParTIAL CORRELATIONS WITH HANDWRITING CONSTANT 














Woody-McCall Spelling 
with Wylie with Wylie 
Grade V data (17 schools).................. .58 .59 
Grade VII data (15 schools)................ .62 .46 
Total data (12 schools)..................... 75 .66 
] 





TaBLE IVZ.—ParTIAL CORRELATIONS WITH WYLIE CONSTANT 





Woody-McCall 
with Handwriting 





Grade V data (17 schools) 
Grade VII data (15 schools) 
Total data (12 schools) 


— .01 
31 
.30 


“eee eeeeeeee 








Spelling with 
Handwriting 


.25 
— .13 
.18 





TaBLE V.—STANDARD DEVIATIONS OF VARIABLES USED IN 

















INTERCORRELATIONS 
>Ba |BSg] So | Sa | 8s | sg | Bs 
33 g 38 z 3 8 7” 3 3 5 38 - 2 
fsa | #s3/ 32 | 33 | 88 | 32 | 32 
oe Q oe Q & a & @ & Q & @ & @ 
eee 12.18 | 6.42 | 50.89 | 52.67 | 49.28 | 47.35 | 52.00 
Woody-McCall....... 3.40 | 2.82 eM + teat @0+%s 7.10 
Ee es 4.62 | 5.35 | 18.21 | 19.83 
Handwriting.......... 4.17 | 4.80 9.87 a 3 aes eee 12.83 
PS Pelee sacae Tt donk B obsont L eoees 12.24 


























“Total” means for the 
Wylie—All age groups 
Woody-McCall—Grades 


IVa, Va and Vla 


Spelling—Grades Vb, Va, VIIb, VIIa 
Handwriting—Grades IV, V, VI 
Park-Franzen—Kindergarten A and B 











A NOTE ON THE MEASUREMENT OF 
MOTOR ABILITY 


KARL M. COWDERY* 


Stanford University 


I 


In a study entitled ‘‘The Measurement of Motor Ability’’* Dr. 
Evelyn Garfiel raises the question of possible definition and measure- 
ment of motor ability, and suggests a relationship between this ability 
and general intelligence. The significant feature of the study is the 
use of an independent criterion as the basis for the definition of that 
which she tries to measure, and for the validation of her tests. Her 
method and plan of treatment are, indeed, praiseworthy, but the data 
seem to the writer to merit further statistical analysis. 

The criterion used was a single set of ratings, as result of committee 
consultation, by gymnasium teachers of the “‘motor ability” of their 
students, sophomore college women. The method undoubtedly 
results in a graded judgment of a special combination of control and 
direction of physical, nervous, and coordinated energies and activities 
which may be a general motor ability, or may possibly be a more 
specific gymnastic ability. Since the committee consisted entirely 
of gymnasium teachers the likelihood is that their point of view limits 
the ‘‘general’”’ nature of the ability measured. 

Following the suggestion of her reading and the results of pre- 
liminary experimentation, Miss Garfiel chooses for final and main study 
tests to investigate five aspects of motor ability: (a) Speed of voluntary 
movement, (b) accuracy of voluntary movement, (c) control of involuntary 
movement, (d) strength, and (e) motor adaptability. 

Contrary to the indications of her preliminary testing the final 
battery does not include a test for accuracy of voluntary movement. 
This omission is hardly justifiable in view of the fact that in the 
preliminary experiments the 3-Hole Aiming Test appears as one of a 
small group which correlated most highly with her criterion. The 
basis of selection for the final battery was avowedly definite correla- 
tion with the independent criterion and low intercorrelation between 
tests. Yet in the selected series of tests there were retained three 
strength tests whose coefficients of interrelationship ranged from .34 





* The writer wishes to acknowledge indebtedness to Professors L. M. Terman, 
T. L. Kelley, and W. R. Miles for helpful suggestions and criticisms. 
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to .50, while the aiming test, which showed definite correlation with 
criterion (.19 as compared with .20 for leg strength) and no intercorre- 
lations in excess of .31, was omitted. 

Miss Garfiel states that her criterion has a reliability coefficient of 
.92. This was obtained by correlating a set of ratings of the ability 
of the subjects made by a committee of three judges with a similar 
set of ratings of the same subjects by the same judges after an interval 
of six weeks. A valid indication of the reliability of ratings is obtained, 
not by repeated judgments of the same subjects by the same judges, 
but by independent ratings of the same subjects by, at least supposedly, 
equally competent judges. It is unfortunate that conditions did not 
permit the three judges to give independent ratings of the subjects 
from which a “true” reliability might have been obtained. As 
explained in the study the fault was not with the experimenter but 
with the lack of familiarity with all the subjects on the part of the 
judges. Perhaps the invalid reliability coefficient might better have 
been omitted. 

For the optimum weighting of the elements of a test battery 
in securing a battery score for correlation with a criterion, Dr. T. L. 
Kelley‘ (pp. 279-95) has pointed out that partial regression coeffi- 
cients should be used. The writer questions whether the method 
of obtaining approximations to these weights as used by Miss Garfiel 
is sufficiently flexible to result in the best available coefficients for use 
‘in the regression equation. Further treatment of this point will 
follow. 

The writer also has the impression that the interpretation of a 
total coefficient of correlation (having a relatively high probable error), 
between the criterion and general intelligence as measured by Army 
Alpha, has led to an imperfect conception of the relationship existing 
between the “motor ability” and intelligence. 

One other minor correction is to be noted. Miss Garfiel has 
summarized the literature on the relations of motor abilities to general 
intelligence in a chart. In this summary she credits Binet and Vas- 
chide with finding ‘‘no correlation.’? However, in the reference cited 
by Miss Garfiel? no attempt was made to measure the strength of 
relationship of motor abilities and intelligence. The writers found and 
report a lack of intercorrelation between certain motor tests. In 
another article’ these same writers report a low but definitely positive 
relationship between the results of several motor tests and teachers’ 
ratings of intelligence. 
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Miss Garfiel’s main contribution lies in the fact that it suggests a 
profitable mode of attack upon the problem of definition and measure- 
ment of motor ability. She has seen the need and sought to obtain 
validation of her method of measurement by means of an outside 
criterion. 

II 


The writer desired to obtain just such a scale as Miss Garfiel 
attempted to build, together with further evidence as to the validity 
of the five aspects of motor ability and as to the relationship existing 
between motor ability and general intelligence. With this purpose 
in mind he has worked over some of the data presented in her article. 

From the table presenting the intercorrelations of the 16 tests 
of Miss Garfiel’s main experiment a selection has been made of 10 
tests. The basis of the choice is definite correlation with the criterion 
and the lowest possible average intercorrelation between tests. Nine 
tests were so selected. In spite of its low correlation with the criterion, 
Army Alpha is also included in the list. This is done for the purpose 
of obtaining an index of the relative place of general intelligence in 
the criterion. Table I gives the chosen tests, the correlation of each 
with the criterion, and their intercorrelations. 


TABLE I.—CoRRELATIONS WITH CRITERION AND INTERCORRELATIONS OF SELECTED 
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0 Criterion......... 63).44).29) .25).23) ..22) .22 |—.19 19| .02 
ee 23|.17| .16).23 14; .26 |—.02 15|—.17| .17 
2 Paper trick....... 11) .13).09/—.08} .08 |—.17)—.11}—.33| .15 
is damewd cons ...|—.07|.33)—.02) .04 | —.12);—.11} .32) .14 
4 Hand strength.....|...|...|...|.....|. 14 20; .26 .02| .26) .06) .14 
5 Foot speed........ ps SOP) He ei ..| -20} .15 | .06) .03) .06) .14 
ccs tancls cole sole wslecesele sole oees .14 .08} .31/—.12) .14 
7 Chest strength....|...|...|...|..... ie ak a ae 
8 Steadiness........ -b SY She ee pecan SEY ee iS A eS —.09| .27| .10 
9 3-Hole aiming..... peated Meorebass wake ss htebted ee) ey ee — .25) .15 
10 Army alpha....... ae aes ee ee eS aoe amet: eterceiay onpheaden .19 
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The selection on the above basis includes but one of the strength 
tests used by Miss Garfiel in her final series, and it includes one which 
she did not use. On this basis of selection one of her accuracy tests, the 
3-Hole Aiming Test, is also retained. Six of the tests are common to 
Miss Garfiel’s and to this selection, namely Running, Paper, Tricks, 
Hand Strength, Tapping, and Steadiness. Her other two strength 


’ tests are replaced by Foot Speed, Chest Strength, 3-Hole Aiming, and 


Alpha. It is worth noting that all five aspects of motor ability enum- 
erated above are represented. 

The intercorrelations and the correlations with criterion are treated 
by Dr. Kelley’s method of successive approximations‘ (pp. 302ff.) 
which results in the optimum weighting of the various elements in 
the battery (partial coefficients of regression), the multiple correlation 
with criterion, the correlation of each test element with the weighted 
combination of the rest of the elements, and the correlation of the 
battery with the criterion when any one of the tests is omitted. 

At the third approximation to the optimum weightings and to the 
multiple correlation two of the tests were found to have detrimental 
influence upon the correlation with criterion. For this reason the 
tests of chest strength and of foot speed were dropped from further 
consideration. Omitting the test for foot speed results in an increase 
of .003 in the multiple correlation coefficient; similarly exclusion of the 
test for chest strength adds .001 to the coefficient. Exclusion of these 
tests saves time in giving the battery and increases its validity. 

Continued treatment of the data for the eight remaining tests 
gives the coefficients listed in Table II, with a final coefficient of .81 
between the weighted score on the battery and the criterion. This is 
to be compared with a coefficient of .77 obtained by Miss Garfiel for 
her battery with her weightings. 

The inclusion of the accuracy test (3-Hole Aiming) is justified 
by the fact that the battery correlation with criterion is increased 
from .791 to .81 by its retention in the series. Its weight is nearly 
equal to that of the tapping test and eight times that for hand strength. 

One test for strength now takes the place of the three in the Garfiel 
battery. This one test contributes comparatively little to the effi- 
ciency of the series and might be omitted without seriously impairing 
the ability of the battery to measure up to the criterion. The hand 
strength test serves to raise the multiple coefficient only from .809 to 
81. The writer prefers to continue to include some measurement 
of strength, as being one of the recognized aspects of motor ability, 
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and as contributing even a slight increment to the score. It is con- 
ceivable that some other measure of strength might be better to meas- 
ure the present criterion, or a criterion based upon some different 


point of view of the judges might call for a heavier weighting of hand 
strength. 


TaBLeE II.—PartTIAL REGRESSION AND MULTIPLE CORRELATION COEFFICIENTS 











Average 
Tests Bou’ (e—u) : To(c—u) a Tule—u) a inter- Garfiel 
‘ 8 weights 
correlation 

p eee er 522, .626 . 203 .15 1.0000 
Paper trick.......... .417 .690 .024 .17 .5591 
Army alpha......... .341 .742 — .377 .22 
Steadiness........... | — .193 .785 .008 11 — .2985 
, Oar .180 .789 .048 .14 . 2376 
3-Hole aiming....... . 164 .791 .024 .18 
i a inkie Saute uke ad .049 .808 . 306 .12 . 2832 
Hand strength....... .019 .809 . 286 .13 . 1537 























Multiple Coefficient, Battery with Criterion, .8096 (Garfiel .77) 
1 EXPLANATORY NOTE: 
Bou'(o—u), Partial regression coefficient, weighting for standard scores in 
regression equation. 


To(e-u), Coefficient of correlation between criterion and battery with this 
test omitted. 


Table value reads “If Running Test omitted, remainder of battery gives 
multiple correlation with criterion of .626.”’ 


Tu(c-u), Coefficient of correlation of given test with weighted composite 
score of the rest of the battery. 

Similarly the “tricks” add comparatively little to the value of 
the battery, and, in case of lack of time available for the administra- 
tion of the tests, they might be omitted without serious detriment to 
the scale. The aspect of motor adaptability is probably adequately 
measured by the Paper Test and the reactions to the various other 
elements in the series. 

Reference to the figures of Table II, which gives the relative sizes 
of the weightings of standard scores in each of the battery elements, 
shows that Army Alpha contributes a relatively important part in 
measuring the ability represented by the teachers’ judgments of motor 
ability. For the range of talent considered (college sophomore 
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women) the measure of general intelligence, after proper weighting, 
serves to raise the multiple coefficient of correlation with criterion 
from .742 (when Alpha is omitted) to .81. In order of size of partial 
regression coefficients this test stands third in the series. 

In view of this comparatively heavy weighting of the general 
intelligence test scores the writer feels unable to agree unreservedly 
with Miss Garfiel when she ‘‘ventures the guess that mental and 
motor ability are different groups of abilities which tend to low posi- 
tive correlations of approximately .10 to .12 for adults in general.” 
The figures of Table II indicate that the test for intelligence has added 
.068 to the multiple coefficient while recognized tests of motor abilities, 
such as tapping and steadiness, contribute increments of but .021 
and .025 to this same coefficient. If the criterion used by Miss Gar- 
fiel is a valid standard for the expression of ‘motor ability,’’ mental 
ability seems to be more closely related to motor ability than Miss 
Garfiel believes. 

In regard to the interrelationships between the intelligence test 
and the other members of the battery, it is noted that the battery with 
Alpha omitted (c — u in the notation of Table II) correlates .742 with 
the criterion. This remainder of the battery, made up of recognized 
motor tests, gives a negative coefficient (—.377) of correlation with 
Alpha. The criterion is apparently measured by tests of two sets of 
abilities whose interrelationships are expressed by negative coefficients. 
This suggests that the “motor ability”’ may be a group of abilities, part 
of which are quite different in nature and distribution from general 
intelligence, but whose control and coordinated expression depend 
directly upon some such ability as that measured by tests of general 
intelligence. 


SUMMARY 


Miss Garfiel, in her study of the “‘ Measurement of Motor Ability,” 
has made a significant contribution to the method of investigation 
bearing upon the definition and measurement of motor ability. 

A new selection from the data presented by Miss Garfiel results 
in a battery of tests which gives a correlation of .81 with her criterion 
for motor or gymnastic ability among college sophomore women. 

Partial regression and multiple correlation coefficients justify 
the inclusion of tests for “speed of voluntary movement, accuracy of 
voluntary movement, control of involuntary movement, and motor 
adaptability.” There is some evidence that strength is to be included 
with these various aspects of motor ability. 
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A comparatively important place is demonstrated for general 
intelligence as a factor in the expression and control of motor ability. 
The interrelationships found raise definite doubt as to the existence 
of a distinct motor ability whose coordinated expression is relatively 
independent of general intelligence. 
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A COMPARATIVE STUDY OF THE STANFORD AND 
THE HERRING Te eet “el THE BINET-SIMON 


CHARLES F. WILNER 
Bureau of Research, State Department of Institutions and Agencies, Trenton, N. J. 


The Herring Revision of the Binet-Simon Tests is an individual 
intelligence examination of the point scale type, consisting of 38 
tests subdivided into six groups, as follows: 


Group A, consisting of Tests 1 to 4 inclusive 
Group B, consisting of Tests 1 to 13 inclusive 
Group C, consisting of Tests 1 to 2 inclusive 
Group D, consisting of Tests 1 to 30 inclusive 
Group E, consisting of Tests 1 to 38 inclusive 
Group K, consisting of 16 non-reading tests 
selected from the 38 (Wilner, 1923). 

Mental ages comparable with those of the Stanford may be deter- 
mined on the basis of any group. 

The examinees included in this study may be divided into three 
groups: 

Group I.! Seventy-two cases from the public schools of Garden 
City; from Letchworth Village Institution for the Feeble- minded; 
and from a private school in Scarboro. The Herring examinations 
were given by Miss Grace Taylor, Miss Jessie LaSalle, and John P. 
Herring. These examinations were given while the tests of the 
Herring Revision were undergoing constant modification. 

Group II. Eighty-two cases from the Garden City public schools 
and the Letchworth Village Institution. The Herring examinations 
were given by Miss Grace Taylor and John P. Herring. These 





1The writer wishes to acknowledge his indebtedness to the author of the 
Herring Revision for the original data on the 154 cases of GroupsI and II. These 
were the cases used in the standardization of Groups A, D, and E of the Herring 
Revision. Mental ages of these 154 cases used in the present study differ slightly 

rom those used by Herring (1924), because the scores used in the present study 
include only those elements which were retained in the published edition of 
the test. Since the same tables of mental age equivalents were used, this produces 
a practically constant decrease in the Herring Mental Ages and hence affects the 
correlations very slightly, if at all. 

Acknowledgment is also due to Mr. Charles H. Fisher, Principal of the 
Bloomsburg (Pennsylvania) State Normal School, whose support of the Bureau of 
Educational Research maintained there from 1921 to 1923 made this study 
possible. 
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examinations were given after the Herring Revision had assumed 
practically final form. 

Most of the Stanford examinations in the 154 cases above were 
given by Miss Taylor, Mr. Herring, and Raymond H. Franzen. The 
remaining few were given by graduate students of Teachers College, 
candidates for the doctorate in Educational Psychology. 

Group III. One hundred and sixteen cases from the public schools 
of Bloomsburg. ‘This group was made up of children who on May 1, 
1922 had chronological ages of 144 to, but not including, 156 months. 
Practically all of the 12-year-old children in the school system were 
included. The published form of the Herring Revision was used. 
Both the Stanford and the Herring examinations were given by 
Marjorie H. Wilner. In every case, the Stanford was given first, 
followed after an interval of not less than six days nor more than four 
months, by the Herring. 

The chronological ages of the examinees at the time the Herring 
Revision was given are shown in Table I. 











TABLE I 
First | Second | “Sten¢ | third 
Chronological age | group 72 | group 82 154 | STOUP 116| Totals 
cases cases ene cases 
cases 

+ 1 1 1 
5 1 1 1 
6 6 6 6 
7 5 5 5 
8 9 1 10 10 
9 3 12 12 
10 12 26 38 aaa 38 
11 7 9 16 — 16 
12 10 23 33 106 139 
13 7 10 17 10 27 
14 2 5 7 7 
15 2 3 5 5 
16 0 1 1 1 
17 0 1 1 1 
18 1 0 1 1 
ecient nndn ts 72 82 154 116 270 
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The following intercorrelations! were found: 


TaBLE II.—Mentat AGEs—First Group—72 Casss 








Herring Revision 









































Stanford 
A B C D E K 

A 1.0000 .9816 .9584 .9588 .9584 .8870 .9394 

B .9816 | 1.0000 .9734 .9701 . 9692 .9261 .9503 

C .9584 .9734 | 1.0000 .9926 .9884 .9573 .9761 

D .9588 .9701 .9926 | 1.0000 .9954 .9523 .9744 

E .9584 . 9692 .9884 .9954 | 1.0000 .9542 .9781 

K .8870 .9261 .9573 .9523 .9542 | 1.0000 .9417 
Stanford .9394 .9503 .9761 .9744 .9781 .9417 1.0000 

TaBLE IIJ].—MeEntat Aces—Sreconp Group—82 CasEs 
Herring Revision 
Stanford 
A B C D E K 

A 1.0000 .9746 .9535 .9549 .9520 .9165 .9444 

B .9746 | 1.0000 .9852 .9861 .9837 .9584 .9767 

C .9535 .9852 | 1.0000 .9918 .9875 .9671 .9794 

D .9549 .9861 .9918 | 1.0000 .9953 . 9663 . 9866 

E .9520 .9837 .9875 .9953 | 1.0000 . 9642 .9877 

K .9165 .9584 .9671 . 9663 .9642 | 1.0000 .9607 
Stanfo .9444 .9767 .9794 . 9866 .9877 .9607 1.0000 





























1 Using the method described by Toops (1922). 


the work of Miss Ruth Terry and Mr. Stephen Lerda. 


These computations are 
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TaBLe IV.—MeEntTat AGEs—Grovups I anp II—154 Caszs 
Herring Revision 
: Stanford 
A B C D E e] 
i 
A 1.0000 .9757 .9518 .9518 .9526 .9008 .9419 
B .9757 | 1.0000 .9785 .9801 .9757 .9448 . 9636 
C .9518 .9785 | 1.0000 .9906 .9883 . 9634 .9766 
D .9518 .9801 .9906 | 1.0000 .9953 .9615 .9804 
E .9526 .9757 .9883 .9953 | 1.0000 .9602 .9843 
K .9008 .9448 .9634 .9615 .9602 | 1.0000 .9513 
- Stanford .9419 .9636 .9766 .9804 .9843 .9513 1.0000 
TaBLE V.—MENTAL AGEsS—TuHIRD Grourp—116 CasEs 
Herring Revision 
Short 
Stanford Stanford 
A B C D E K 
A 1.0000| .9334| .8777| .8254| .7982) .6801| .7810 .7729 
B .9334'1.0000! .9515| .9212' .8964|) .8194' .8852 .8744 
C 8777 .9515|1.0000 .9654| .9443| .8639| .9283 | .9110 
D .8254| .9212) .9654)1.0000) .9785) .8958) .9599 .9465 
E .7982' .8964| .9443) .9785|1.0000| .9035| .9769 .9577 
K .6801| .8194) .8639)| .8958; .9035)1.0000) .8888 .9199 
Stanford .7810; .8852) .9283; .9599| .9769) .8888) 1.0000 .9736 
Short Stanford | .7729| .8744| .9110) .9465) .9577| .9199) .9736 | 1.0000 
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TaBLeE VI.—Mentat Aces—Grovps I, II anp III—270 Cases 
Herring Revision 
Stanford 
A B D E K 

A 1.0000 .9880 .9535 .9506 .9373 .8779 .9212 

B .9880 | 1.0000 .9875 . 9427 .9750 .9280 .9617 

C .9535 .9875 | 1.0000 . 9886 . 9806 . 9496 .9710 

D . 9506 . 9427 .9886 | 1.0000 . 9923 .9527 .9805 

E .9373 .9750 . 9806 .9923 | 1.0000 . 9396 .9845 

K .8779 .9280 . 9496 .9527 .9396 | 1.0000 .9510 
Stanford .9212 .9617 .9710 . 9805 . 9845 .9510 1.0000 





TaBLE VII.—MeEAN MENTAL AGES OF Eacu Group or EXAMINEES 





























| 
A) B | C | D| £ | K |Stantord 

Group I (72 cases)........ 104.97 99.71/100.00 100.38 100.39|104.14) 103.76 
Group II (82 cases)........ 115. 41|108.75)111.48/112.40 112.01)117.75| .16.90 
Groups I and II (154 cases) 110. 53|104.52)106. 11 106.77 106.57|111.38| 110.75 
Group III (116 cases)... .|157.47/141.96|142.95)144.93,147.71/146.19| 147.87 
Groups I, II and III (270 

on EE 130. 69/120 60)121..94 123. 17,124.24'126.34] 126.70 
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TaBLe VIII.—Meran MENTAL AGES BY CHRONOLOGICAL AGE GROUPS 
A’ 

Cian. Mean MA’s 

logical Number CA 
age of cases s 

(months) B C D E tan- 

ford 
48 1 77.0 | 75.0 | 75.0 | 73.0 | 77.0 | 76.0 | 53.0 
60 1 80.0 | 79.0 | 78.0 | 76. 78.0 | 70.0 | 65.0 
72 6 86.67) 87. 33) 88.0 | 87. .5 | 90.0 | 78.83 
84 5 83.4 | 84. 2 | 85.2 | 84. 8 | 85.4 | 89.4 
96 10 98.2 | 98.5 | 97.8 | 96. .4 | 99.2 100.9 
108 12 110.4 {111.7 113. 2 114. .33 119.83,112 .66 
120 38 115.47/118.82 119. 26, 118. .39 124.55 125.95 
132 16 89.8 | 87.5 | 88. 25) 87. .25| 90.25/135.31 
144 139 136.05/137.85 140.04/142. .9 |143.84 149.14 
156 27 118. [118.9 |124.9 |121. 1 |125.2 |158.5 
168 7 82.7 | 79.0 | 79.5 | 78. 3 | 81.2 172.7 
180 5 82.8 | 80.4 | 80.5 | 79. .4 | 85.4 |186.2 
192 1 81.0 | 80.0 82.0 | 80. .0 | 79.0 197.0 
204 1 84.0 | 86.0 86.0 | 86. .0 | 92.0 |206.0 
216 1 95.0 | 85.0 | 82.0 | 81. .0 | 84.0 216.0 
Paes) 
270 








_- 
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TaBLE [X.—STANFORD MENTAL AGE GROUPS 


















































Mean Herring mental ages 
Mean Number Mean 
Stanford MA Stanford 
MA of cases CA 
A B C D E K 
60-71 69.6 3 78.0| 78.3, 77.7| 77.7 | 76.7 | 78.0) 78.0 
72-83 79.5 40 80.5) 79.9) 77.2| 77.9 | 76.65) 80.7/113.85 
84-95 89.1 37 89.0) 86.7) 85.1) 83.2 | 84.6 | 87.6|145.2 
96-107 102.0 9 101.2) 96.5 96.2) 96.2 | 96.2 | 99.4/125.4 
108-119 113.3 21 123 .2)111.7)111.8)111.4 |107.9 |115.3/131.6 
120-131 126.2 27 132.8/124.5)124.9 123.1 123.5 |126.7|136.5 
132-143 137 .2 44 145.3 128.2133.6134.9 |133.3 1136.2 141.2 
144-155 153.5 35 152.3139.6 143.8 145.4 147.2 |151.5)144.03 
156-167 159.9 17 162.4 146.3/151.7/158.05 160.9 155.9|151.7 
168-179 172.3 18 170.2 154.5/161.8 166.6 171.6 |170.8)146.7 
180-191 202.8 11 182.8 164.0170.1176.9 179 .09|179.5|147.2 
192-203 199.0 3 196 .0181.0'180.3/189.3 139.6 |184.0/151.6 
204-215 209.0 2 198 0/183 .5)187 .5|197.0 205.5 202 .0/152.0 
216-227 217.0 2 222.0 198.5,195.0205.5 214.0 (211.0)149.5 
228-239 231.0 1 Ce opr aan 230.0 ——- 
270 | | 








TABLE X.—STANDARD DEVIATIONS OF MENTAL AGES 





Examinees 


Herring 








Group I (72 cases)...... 
Group II (82 cases)..... 
Groups I and II (154 cases)..... 33. 19'26.94 
Group III (116 cases)... 
Groups I, II and III (270 cases). 





| 


32.05 26.38 


24.91/20 .87 
37 2 .69 


| 





25.94 
30.93 
21.25 
32.75 





Stanford 





| | 
ea: 1d 33.20/29 .66 29.33 30.3631.99 31.51 
31,.22'32.27134.45 
31 .44'32.66/33.94 
23 .6925.22'26.04 
‘ie Wake iene 








32.32 
33 .00 
33.45 
25.11 
35.32 
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TaBLE XI.—Vauipity RELIABILITY COEFFICIENTS BASED ON Group III Examr- 


NEEs 116 CasEs 
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Her- 
ring es Ts TE kg kp ogV1i — 3? EV 1 — PEgV1 — r? PEpvV1 — r? 

Group | 
A 24.91|.7810| .7982|.6245|.6024) 15.6812 15.1925 10.5768 10.2472 
B 20.87|.8852| .8964/.4652|.4432) 11.6812 11.1775 7.8789 7.5391 
Cc 21.25) .9283) .9443).3718)| .3291 9.3359 8.2999 6.2970 5.5982 
D_ |23.69|.9599| .9785).2803) .2062 7.0383 5.2004 4.7473 3.5076 
E! /|25.22).9769/1.0000)|. 2137) .0000 5.3660 .0000 3.6183 -0000 
K  /|26.04|.8888| .9035|.4583).4286) 11.5079 10.8093 7.7620 7.2908 
































¢ is the standard deviation of the mental ages 
Ts is correlation with the Stanford 


r, is correlation with Group E of the Herring 
ky is coefficient of alienation, each group with Stanford 
ky is coefficient of alienation, each group with Herring E 
o,V1 — r? is the standard error of estimate, predicting the Stanford from each Herring Group 
¢, V1 — r? is the standard error of estimate, predicting Herring Group E from each of the other 


Groups 
PE,V/1 — r? and PE, V1 — r? are the probable errors of estimate of the Stanford and of Herring 


E respectively (¢ 4/1 — r? X .6744898) 
1Some of the coefficients for Group E differ from those quoted by Herring (1924) because 
these are based on Ton of .9769, data grouped as required by Toops’ method, while his are based 


on r,,, of .9870, for which class interval was one mental month. 


CONCLUSIONS 


1. Both the Stanford and Group E of the Herring gave a very 
reliable estimate of mental age. In the three groups of examinees 
considered the lowest E/Stanford correlation was .9769. The same 
data grouped in class intervals of one mental month give .9870. From 
this correlation, using the Brown-Spearman formula we get a reliability 
for the Herring and Stanford combined of .9935 (Herring, 1923) 
for which k is .1138. 

The probable error of estimate of the combined tests in predict- 
ing the result of a similar combination of equal reliability is 1.9274 
mental months. The probable error of estimate of a true score from 
the combined tests is 1.3579 mental months (Formula 169, Kelley, 
1923). In computing the probable error of a true score, o of the 
combined scores was taken to be 25.165, the average of o, and gs. 

2. The data do not warrant any positive statements concerning 
validity or concerning identity with the Stanford. Kelley (letter 
to Dr. John P. Herring, March, 1924) states as a criterion of the 
identity of two tests a correlation, corrected for attenuation, of 1 or 
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very close to 1 as judged by the probable error. The data do not, 
however, readily permit the use of Kelley’s formula for correction for 
attenuation, but the Stanford-Herring E correlation of .9870 suggests 
that the two examinations are nearly identical. 

3. Group D of the Herring is only slightly less reliable than Group 
EK, as the following table shows: 


Ts ks osV 1 —r? PEsV1 —fr? orvV 1 —r? PEsvV 1 —r 
cas . 2803 7.0383 4.7473 5.2004 3.5076 
E..... .9769 . 2137 5.3660 3.6193 


For most purposes nothing less than Group D should be used, and 
whenever possible Group E. 

4. Groups A, B, and C of the Herring should seldom be used alone, 
although Avery (1924) finds a C/Stanford correlation of .824 + .0312 
and an E/Stanford correlation of .787 + .037, the examinees being 
48 Grade I children. Interpretation of Avery’s correlations is 
rendered difficult by his failure to give the SD of the group. Anr 
of .824 when o is 12 mental months, for example, is equivalent to 
an r of .959 in a group of which o is 25 mental months. 

5. The correlations between the shorter Groups of the Herring, 
especially Group A, and the Stanford in the 154 cases are so much 
higher than the correlations in the 116 cases that the larger o of the 
154 cases does not alone explain these differences. Since most of the 
116 cases fall in the upper range of these groups where there is a smaller 
amount of test material (in Group A, for example, 1 point in score may 
make a difference of 6 months of mental age) it is probable that errors 
of the type described by Cobb (1922) and Wilner (1923) are introduced. 

6. The data do not support conclusions concerning the reliability 
of Group K, because 

(a) All cases here presented were used in the standardization of 
this Group. 

(b) The tests in Group K were selected as measuring somewhat 
different from those measured by the other tests. 

(c) Group K was planned for use with problem cases, while most 
of these examinees were normal children. 

7. The number of examinees with Stanford MA’s less than 6 
years (three) is too small a group on which to base conclusions. They 
do, however, tend to support Avery’s statement that the Herring tests 
give too high a rating at this level. Until more children on this level 
have been given the two examinations it will not be possible to state 
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whether this apparent difference is due to chance; to inadequacy of 
the norms at this point, or to the test material itself. Further analysis 
of Avery’s data may supply a tentative answer. 
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THE RELIABILITY OF ACCOMPLISHMENT 
DIFFERENCES! 


JOHN P. HERRING 


Director, Bureau of Research 
New Jersey Department of Institutions and Agencies 


Accomplishment differences are differences between intelligence 
and educational estimates. They are illustrated by the McCall 
F = Te — Ti = educational T-score minus mental T-score. 

The reliability of such differences varies 

1. Directly with the reliability of the tests employed, and 

2. Inversely with the correlation between intelligence scores and 
achievement scores. 

These two relationships are implied in Chapman’s (1923, unr.) 
formula? for the reliability of accomplishment differences 


Tu + 'ss 





~~ 7. 





T. = 
kc 1 — T15 


Examine the two statements. 

1. Accomplishment differences are comparatively reliable when 
the tests employed are comparatively reliable. 

This is, of course, to be expected; assuming the formula, the truth 
of the statement is seen in Table I; and inspection of the formula 
shows the statement to be an algebraical relation therein. 

A reliability correlation based upon two forms of the same test 
naturally and probably tends to be higher than the correlation between 
two different tests of presumably the same thing but of different author- 
ship. The latter correlation, which is the more likely to be a mixture 
of validity and reliability, may not properly be reported as a reli- 
ability coefficient. The problem of validity is irrelevant and should 
be the topic of a separate study. Illustrate this with two tests in 
arithmetic, one of which measures a sufficiently wide variety of 
arithmetical processes while the other is too narrow, as for instance 
with the Stanford Achievement Tests in Arithmetic on the one hand 
and the New Jersey Composite Arithmetic Tests on the other. It is 
not, @ priori, wholly fair to call the correlation between the National 





1 Prepared under the auspices of the Division of Education and Classification, 
William J. Ellis, Director. 
2? Chapman’s symbols throughout. 
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Intelligence Tests and the Pressey Intelligence Tests a reliability 
coefficient !! 

2. Accomplishment differences are comparatively reliable when 
the correlation between achievement and capacity is compara- 
tively low. 

This is reasonable, since the significance of a standard error of a 
measure is revealed through its ratio to the measure; it is confirmed 
by an inspection of Table I; and it is seen to be an algebraic relation 
in the formula. 

The situation is seen more in detail in Table I. 


TaBLE I.—REwIABILITY r’s AND k’s For Various RELIABILITIES OF Tests UsED, 
AND Various DEGREES oF CORRELATION BETWEEN INTELLIGENCE AND 
ACHIEVEMENT TESTS 




















Tu + Tss a 1s 
ame? 1 — rg 
I II III IV V 
rut+rs + 

2 

.96 .935 89 85 55 
TIs TG,4, ko,o, TG,4, ka,e, Ta,a, ka,o, Ta,¢, kaa, Ta,4, ke,o, TIs 
90 .60 | .80 | .85 | .94/... OD, yd 90 
80 .80 | .60 | .68 | .73 | .45 | .89 | .25 | .97 80 
70 .87 | .49 | .78 | .63 | .63 78 | .50 87 70 
60 .90 | .44] .84| .54] .73 | .68 | .63 | .78| ... |]... 60 
50 .92 | .89 | .87 | .49 | .78 | .68 | .70| .71 | .10 | .99 50 
40 .93 | .87 | .89 | .46 | .82 57 | .75 | .66 | .25 | .97 40 
30 .94 | .84)| .91 | .41 | .84 54 | .79 | .61 | .36/ .93 30 
20 .95 | .81 | .92 | .39 | .86 51 | .81 | .59 | .44] .90 20 






































a, and az are accomplishment differences. 

1 and I are alternative forms of an intelligence test. 

s and S are alternative forms of a school achievement test. 
k is \/1 — r?, the coefficient of alienation. 





1 The considerations urged by Chapman are fair as a support for the ideal of 
valid selection of tests, and unfair as material argument against the reliability in 
general of accomplishment differences. 
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The coefficient of alienation, k, is used to interpret r. The magni- 
tude of the r’s in Table I, when used for prediction, can be interpreted 
by means of the following Table II. 


TaBLE II.—INTERPRETATION OF CORRELATION WHEN USED FOR PREDICTION 
k k 


r r 
ir Taw cee acy o5 00-11 994-100 Extremely high 
Second ninth................ 11-22 976-994 Very high 
, RRR a 22-33 944-976 High 
i tka n es ho 6 33-44 898-944 High average 
IR, 6 Spo usa wo 0 Kade 44-56 829-898 Average 
ER a er 56-67 74-829 Low average 
Seventh ninth............... 67-78 63-74 Low 
Rc ivcs cs cic asses 78-89 — 46-63 Very low 
RUNES S Pcs cee cues 89-100 00-46 Extremely low 


k = V/1 — r? = coefficient of alienation (Kelley, 1923, p. 173). 


Column I of Table I represents for the present very good working 


conditions. When the average reliability is .96, whether this is 
98+94 97+ 95 
—— oF 2. oF any other average equal to .96, we see from 
inspection of r and k that the measures are precise enough to afford 
genuinely educative control of the individual. The limitations of 
accomplishment ratios as set forth by Toops and Symonds (1923) and 
of accomplishment differences by Chapman (1923) have freed these 
devices from certain misinterpretations and misuses and have left 
them, when rightly employed, diagnostic and remedial instruments 
at once discriminative and powerful. It is already occasionally 
possible to do routine measuring with an average reliability of .96, as 
for instance by means of the Binet-Simon Tests and some of the Stan- 
ford Achievement Tests (Herring, 1924 and Kelley, 1923, Standard). 

Columns II, III and IV of Table I represent conditions of somewhat 
more frequent occurrence. With average reliabilities of .935, of .89, 
and of .85, r and k indicate less and less precision of control—yet the 
situation with .85 is still superior to the results indicated in Chapman’s 
article in which the r.,., was not far from zero. It is fair to inquire 
whether it is not now possible to do 80 per cent of our routine measure- 
ment in the elementary school with reliabilities averaging at least 
.85, and to leave him who selects inferior tests to justify his way. 
The query is conservatively stated, for in my judgment .90 is usu- 
ally possible. 

Column V of Table I represents work inexcusably inaccurate in 
almost any situation. By the time we reach .55, a casual inspection 
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of r and k indicates that at this level our tests leave us in a condition 
of, say, 10 per cent vision in one educational eye and none in the other! 
Here only the grossest differences are with difficulty made out at all, 
and their magnitude is a matter of almost entire uncertainty. We 
may perhaps judge that only when average r is at least .85 and k is 
.53 are the data fairly usable for accomplishment differences. 

Analyze two of Chapman’s statements. 

1. He says: ‘‘We will assume, as is reasonable, that the true 
correlation of the ideal intelligence test and the ideal school test is .7.’’ 

This assumption is by no means in accord with that made by 
Franzen (1921 and 1922) that such correlations vary with total educa- 
tional efficiency over a range, say, from .10 to .80—an assumption at 
least tentatively verified by him experimentally through increasing 
school efficiency and then re-measuring. Herring has raised such 
an r from .35 to .75,' probably through motivation, in a class of 90 
teachers studying educational measurement. The fact that this 
was done in about three weeks confirms his impression that accom- 
plishment differences respond sensitively to control. Moreover, 
Franzen orally suggested, it is precisely when the differences between 
ability and achievement are large, and the correlation between them 
low, that we wish to be especially certain of their existence and magni- 
tude. When they are small, and when the correlation between them 
is high, the hypothesis is that the school is then working efficiently 
in the particulars measured, and we may have less concern over the 
unreliability of small individual differences between intelligence and 
achievement scores, provided we are sure they are small, 7.e., provided 
r is high and g, is low,? or, rigorously, provided the average difference is 
small. Now the Chapman Formula behaves just as we should wish 
it to do, and just as measures of reliability in general do, giving us 
our greater certainties when the differences are large and the correla- 
tion is low. 

2. He also says: “Such facts as are presented above must be 
recognized by those who propose to determine the difference within a 
single grade of intellectual and school achievements when measured 
by such instruments as are at present available.” 

This statement suggests the assumption that the tests with which 
he entered his formula are the best available and the implication that 





1 For which ra,z, = about .87 and .84 respectively. The assertion has the 
unfortunate limitation that no control group existed. 
2 High r and low o are not rigorous criteria. 
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accomplishment differences have, in general, extremely low reliability. 
As for both assumption and implication, enter the formula with .987 
for the Stanford Revision of the Binet-Simon Tests and .92 for 
separate Stanford Achievement Tests, e.g., tests in paragraph reading. 
Inquire, also, what is the best we can now do? The best estimate 
Herring (1924) has thus far been able to make of the reliability of the 
Stanford or Herring Revision of the Binet-Simon Tests when adminis- 
tered by an examiner carefully trained by him is r = .987.! The 
reliability of average educational ages of the Stanford Achievement 
Tests is reported by its authors as .982 (Kelley, 1921, stan.). The 
formula yields Table III. 


TaBLE III.—RELIABILITY OF ACCOMPLISHMENT DIFFERENCES BASED UPON 
THE STANFORD OR HERRING REVISION OF THE BINET-SIMON TESTS AND 
UPON THE STANFORD ACHIEVEMENT TESTS 


Trs Tajag Kkajag 

.70 .948 .318 
50 . 969 . 247 
.30 .978 . 209 


By inspection of Tables I and III it appears that when we use 
very high reliabilities, r;; matters relatively little. This situation is 
ground for a degree of optimism regarding accomplishmeut differences. 





1 Other constants of the data are n = 116, 8; = 0.1557 + 0.1303, 6, = 3.2126 
+0.5122, ova = 25 mental-months. The group is an unselected 12-year-old age 
group. 

Avery (1924) reports lower correlations found with 48 Grade I children of 
Palo Alto. Since constants were not reported in age groups and standard devia- 
tions were not presented, the coefficients are difficult to interpret. This is because 
of the well known fact that correlations decrease spuriously in magnitude with 
decrease in dispersion of measures. In theory, r = 1.00 can thus be reduced to 
r= 0. In practice, radical reductions frequently occur. In the more homogene- 
ously classified schools, which are presumably increasing in number, and which are 
perhaps illustrated in Palo Alto, the reductions are the greater. We may expect 
further reports of r Herring Stanford radically lower than .987, in part because 
oma Of the data reported is less than 25 months. 

It is accordingly suggested that when such coefficients are needed for com- 
parison outside the data used in their computation, (1) reports be made as in 
unselected 12-year-chronological-age groups; (2) that if the standard deviation 
differs from 25 or 26 mental months (or its equivalent when such is possible) the 
formula 
o Vi-R ; 
te Mg tar (Kelley, 1923, stat.) be employed to estimate the correlation that 





would be obtained in such groups; and (3) that if this estimate is not given, stan- 
dard deviations be reported so as to render the use of the formula and the inter- 
pretation of correlations possible to readers. 
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The phrasing “determine the difference within a single grade”’ 
suggests that not the best statistical convention was used—that of 
reporting reliability correlations as of age groups rather than of grade 
groups. There is still recourse, however, to the formula described by 
Kelley! (1923, stat.) and Otis! (1922) by which any r’s obtained in 
a group of which the standard deviation differs from 25 or 26 months 
of mental or educational age can, by any who know the o of the group, 
be translated into r’s that would be obtained in unselected age groups 
like the Bloomsburg 116 12-year-olds? (Herring, 1924). 

The amount of this correction for ra,4, may be estimated. Herring 
has roughly estimated the average standard deviation of educational 
ages for 21 school grades. These range roughly from 3 to 12 educa- 
tional months and average about 8. For 24 school grades similarly 
estimated the standard deviation of mental ages ranges roughly from 
4 to 20 mental months, and averages about 9. Let us suppose that 
Chapman’s grade groups had standard deviations as high as 12 or 13 
months. This is about 4% the standard deviation of a 12-year old 


: o/i = 
age group. Correcting by means of the formula, 5, = — 
—T 


(Kelley, 1923) Chapman’s first table on page 106 (1923) becomes 


r 
TepOeiiemee With TMGGTIMONCS. .. 0... ccc cc ccc cc cccccccccccces .87 
i . .. . cdeetceecéueeedusaveteses .90 
ee NE ec coc ceecdpoavscedcoeacsseeseces .88 
ee a lS. sels dame eee ease Ho aaa .94 


1 .87 + .94 — .90 — .88 . 

2 (1 Te —88)% = .27 for which k 
is .96. k interprets r. These magnitudes are due, in the main, to 
comparatively high correlation between schooling and achievement. 

The following statements supported in part by data herewith 
presented, gather about the Chapman formula: 

1. The accomplishment difference exhibits high reliability when 
the reliability of the examinations is very high. Such examinations 
V1I-R 
Vi-r 

The two formulas are identical. 

2T. L. Kelley reports in a letter that he ‘‘found for unselected 12-year-olds 
on the Stanford Achievement Test” ‘‘a standard deviation of” “‘two years, two 


months.”’ Herring found the standard deviation of the Bloomsburg 116 12-year- 
olds to be 25.17 mental months. 





For these data ra,a, = 








2 
1 Kelley gives 5 = Otis gives R = 1 — (1 —1) 
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are not infrequently available. Accomplishment differences exhibit 
high average or high reliability when the reliability of the examina- 
tions is high and the correlation between intelligence and schooling 
is .20 to .50—a not infrequent range. The terms high average, high, 
and very high are quantitatively defined in Table II. 

The accomplishment difference is by no means done for on the 
count of unreliability. This statement is based upon known reli- 
abilities of certain examinations, upon Chapman’s Formula and upon 
nothing else. These two bases seem sufficient for the conclusion. 

With such instruments as the Binet Tests and the Stanford 
Achievement Tests we may often expect reliability represented by 
Ta,a, = -90, and up. 

2. A new epoch in the reliability of group testing may prove to 
have appeared by reason of the publication of the Stanford Achieve- 
ment Tests. 

3. Tests which appear without announcement at least of coeffi- 
cients of reliability obtained in groups whose standard deviation is 
given, are properly viewed with suspicion. Reliability—never to be 
assumed—cannot be safely divined by inspection by the most sagacious 
statistical wizard. Even the good author and the good book company 
are not sufficient criteria for the selection of tests! No such criteria 


2r 
: 


= 


will do instead of ri, = (Kelley, 1923, Formula [158]) in 


sit 


unselected age groups, and other criteria (Franzen, 1921, 1922 and 
1924). It is unfortunately sometimes true that persons of even 
national repute have selected wrong tests for specific purposes because 
they have used wrong criteria or failed to use all criteria. So far as 
reliability is concerned we must have, other things being equal, the 
highest available. For the purpose of selecting tests, reliability is in 
general second in importance only to validity. 

4. The reliability of some educational tests in common use is from 
.50 to .75, and this should be called too low for use. Good work has 
for the last few years frequently meant reliabilities of .90 and up with 
group tests of both intelligence and achievement. It is quite feasible 
for many purposes in common vogue to work with instruments which 
yield satisfactorily reliable accomplishment differences. 

5. The lower the correlation between intelligence and achievement, 
the more reliable the accomplishment differences. The higher the 
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it correlation, the smaller the differences between intelligence and 
A- achievement—a condition in which effective learning is strongly sug- 
ig gested and in which we become less interested in determining the 
h, existence and magnitude of differences. 

By no means may 17s be estimated at some consistent and typical 
e amount for various school groups, but ranges widely, usually some- 


i- where between .10 and .80, probably varying roughly with the effec- 
n tiveness of educative stimuli. Franzen (1922) assumes unity as the 
1 limit of this range, which some day scientific control may make our 
d rjs’8 approach. 
y 6. Popular thinking sometimes seems to suppose that r = .99 
means correlation which is 99 per cent perfect for all purposes including 
prediction. The purpose is very frequently prediction; validity, 
- reliability, objectivity, and regression are commonly cases of predic- 
tion. The truth is that in these extremely common uses of correla- 
- tion, 99 per cent perfect correlation must be expressed by r = .99995— 
an infrequent luxury! r = .99 is only 86 per cent of perfect correla- 
tion for the purposes of prediction. 

A moral from Chapman’s Formula is the desirability of using 
highly reliable tests, a moral which at present in measuring intelligence 
almost always points toward the use of the Binet-Simon Tests as 
against group tests whenever individual estimates are to be made 
and whenever this policy is feasible. The interpretation, however, 
which even competent critics have sometimes made, is that the 
accomplishment difference is itself to be abandoned. Popular confi- 
| dence was perhaps shaken with regard to accomplishment differences 
rather than with regard to unreliable tests. The emphasis in Chap- 
man’s argument is properly directed against accomplishment differ- 
ences only when they are based on unreliable measures and not at 
all against accomplishment differences as such. The formula affords 
needed stimulus for selecting highly reliable instruments of measure- 
ment. In view of the ease of determining reliability there seems to 
be little excuse for failing to study the correlation between random 
halves or between two forms the first time a new test is administered 
and for failing to report the result of the study the first time the 
test is published or otherwise publicly reported. 

The desirability of using valid tests is also emphasized in the 
formula. If a test supposedly of intelligence but really, in some 


1 When k = .01, r = .99995 
When r = .99, k = .14 


= 
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degree, of school achievement, is correlated with an achievement 
test, there results a spuriously high r;; and a spuriously low f,,2,. 
The accomplishment differences then appear less reliable than they 
really are, and the teaching appears more effective than it really is. 

The chief outcome is not merely that the accomplishment difference 
is for the time saved. That is important enough, but of more inclusive 
significance is the understanding that if we are ever to cease providing 
the negative critics of measurement with obvious points of onslaught, 
we must, for one thing, know that a reliability of r = about .95 and 
k = about .31 is often both feasible and necessary for good work. 
Fortunately, careless field work brings disrepute. 

In view of current frequency of disregard of validity and reli- 
ability coefficients in selecting tests for routine use, Chapman has done 
a great service, for his formula is influencing examiners toward higher 
standards in this particular. His service is the greater in view of his 
having made generally available a means for determining amounts of 
error involved in work with accomplishment differences. 
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INTELLIGENCE RE-DEFINED 


The Nature of Intelligence, by L. L. Thurstone. New York: Har- 
court, Brace, 1924. Pp. 167. 


This volume starts out with a critical discussion of the stimulus- 
response formula and the whole system of psychological interpreta- 
tions which reduces mental activity to this mechanistic basis. The 
author sees the possibility of a synthesis in which conflicting schools 
of psychological thought can be harmonized and proceeds to the 
presentation of his thesis. 

He says that every scientific problem is a search for the functional 
relation between two or more variables and then examines the variables 
which various approaches to experimental psychology recognize. 
In this connection he cites the following typical experiment: 

“We place before a subject a tachistoscope and he sees nonsense 
syllables. He tries over and over again until he has learned them. 
Out of this scientific experiment comes the scientific deduction that, 
other things being equal, he remembers best those syllables that he 
saw first, before the novelty wore off, and those which are at the end 
of the list . . . This is a scientific experiment in which we state 
the relationship between two variables. The answers of the subject 
are described as a function of the stimulating nonsense. But how 
about the incentives? The most important factor is whether or not 
the subject cares about the nonsense syllables. This factor of interest 
and effort overshadows entirely the small effects of the arrangement of 
the syllables. The experiment is scientifically quite legitimate but it 
is trivial in respect to the factors that are most important for mental 
life. 

‘We recognize, of course, this fact: That incentives are more 
important than the arrangement of syllables in the page in predicting 





1 Unsigned reviews were prepared by L. Z. 
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the recall. But since the incentives are not readily measured, we rest 
content with describing the relations that we can measure. Well and 
good. This would not be subject to criticism if it were not for the 
fact that we have come to forget the individual person altogether. 
Experiments of this type have come to be the rule and we have taken 
for granted that psychology is primarily concerned with the incidental 
relation between the response and the stimulus. . . and have 
forgotten the person who may or may not want to do the responding. 

“T suggest that we dethrone the stimulus. He is only nominally 
the ruler of psychology. The real ruler of the domain which psychol- 
ogy studies is the individual and his motives, desires, wants, ambitions, 
cravings, aspirations . . . The psychological act which is the central 
subject matter of psychology becomes then the course of events, 
primarily mental, which intervenes between the motive and the 
successful neutralization or satisfaction of that motive.’ 

The various implications of this point of view are developed in the 
ten intervening chapters and the final chapter re-defines intelligence 
on various levels. The author maintains that “biologically the higher 
thought processes serve the same purpose for the organism as the 
simplest anatomical differentiation of the exploring function.” Intelli- 
gence is the ability to preconceive the effects of activity without carry- 
ing the act to completion, representing and dealing with experience in 
terms of cues and symbols, thus eliminating inappropriate and 
selecting appropriate reactions without actual overt trial and error. 





Tue THEORY AND PRACTICE OF AN EXPERIMENTAL SCHOOL 


Experimental Practice in the City and Country School, by Caroline Pratt 
and Lula E. Wright. New York: Dutton, 1924. Pp. VIII + 302. 


This is the second volume of the records of groups in this experi- 
mental school. Miss Stotts’ ‘‘Record of Group Six” was reviewed 
in this Journal in 1921. Miss Wright’s record is an improvement on 
the earlier form; and from it readers may gain a clear idea of the actual 
processes which express the theory which the school strives to incorpo- 
rate and exemplify. 

Miss Pratt’s introduction is too full of feeling and argument to 
satisfy those who prefer an impartial presentation of assumptions 
and hypotheses and evidence upon which to base their evaluations. 
Most of the theory is acceptable enough to stand without the support 
of exaggerated claims. Much of the practice in the special subjects 
still needs to be permeated with the philosophy of the experiment and 
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is not of one piece with Miss Pratt’s theory or Miss Wright’s philos- 
ophy of values. 

The philosophy itself has not arrived at that calm stage of self- 
evaluation which is reflected in a well organized presentation and 
appeal to the reader’s intelligent but impartial appraisal. But the 
vividness of the record, the worth-whileness of the experiment, and 
the evident sincerity of the whole undertaking, make one patient with 
a defect of presentation which may be due to an excess of zeal on the 
part of the person who has identified herself with her work. 





CoNCRETE APPLICATIONS OF THE LAWS OF LEARNING 


Psychological Principles Applied to Teaching by W. H. Pyle. Balti- 

more: Warwick and York, 1924. Pp. VI + 197. 

Psychologists seem to be awaking to the fact that psychology 
can be applied in the presentation of psychological principles. This 
author prefaces his unique manual with the following frank admission 
that general theory courses are practically valueless to teachers. 
Nevertheless he believes that the applications of psychology in the 
solution of classroom problems is ‘‘the one hope for the science of 
teaching.” 

He has, therefore, assembled 115 specific statements of psycholog- 
ical principles or laws, and listed under each of these a number of 
pertinent concrete instances of their application. The instances or 
applications are selected for their illustrative value. 

The organization is novel. A principle is concisely stated in bold 
face type. The student is referred to the page in ‘‘The Psychology of 
Learning,’”’ an earlier book by the same author, for a systematic dis- 
cussion of the principle. There follow three or four or more illustra- 
tive applications from as many fields or subjects. All of this material 
is placed on the left hand pages. The right hand pages are left blank 
for “‘Teacher’s Notes.” There is a subject index and a general index. 
By using the former it is possible to study the applications of psychol- 
ogy to any particular school subject. For instance, the teacher of 
foreign languages will find that the book illustrates 14 applications of 
psychological principles to his particular field. Thirty-two other 
subjects of instruction are similarly treated. 

The manual should be exceedingly helpful as a means of making 
principles meaningful. In criticism it should be said that the order of 
contact is still not psychological. Principles should grow out of 
concrete stituations. Principles or generalizations are the general 
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aspects common to a series of particular situations. Thus the best 
assurance of a real grasp and realization of principles is the experience 
of deriving or abstracting them from concrete unclassified experiences. 

It is gratifying to note that psychologists who teach students of 


education are beginning to practice something of what they preach. 
But there is still room for improvement. 





THE KNOWLEDGE EQUIVALENT OF MENTAL AGE 


An Inventory of the Minds of Individuals of Six and Seven Years Mental 
Age, by Grace A. Taylor. Teachers College, Columbia Univer- 
sity Contributions to Education No. 134, 1923. Pp. 147. 


This book is a report of a study of 512 children of the mental ages 
of 6 years, 0 months to 7 years, 11 months, with the objective of finding 
out what children of these mental ages know irrespective of chrono- 
logical age, intelligence quotient or school knowledge. The subjects 
were 312 children from a school for feeble-minded children, and 200 
children from a public school of New York City (normal group), 
with an additional 40 children (superior group) who were given parts 
of the inventory. The chronological ages of the children ranged from 
4 years to 20 years. 

Preliminary selection of the children was on the basis of the 
Stanford Revision of the Binet Test, supplemented by the Herring 
Revision. For greater reliability in the determination of mental age, 
a battery consisting of Army Alpha, Pintner Mental Survey, Pressey 
Primer and Thorndike Non-verbal Tests was used. 

The inventory included 230 items, chiefly questions on such topics 
as personal information, general information, knowledge of parts, 
life situations, vocabulary, memory, etc. 

From the point of view of the ordinary classroom teacher it is to 
be regretted that so large a proportion of the subjects in the study were 
subnormal. The author finds, however, that six and seven years 
mental age is about the same for the various chronological ages, 
although there are some tests which older children pass because of 
their greater maturity. Where the results are analyzed on the basis of 
intelligence quotients, the tables are rather difficult to interpret because 
of the varying number of individuals in the different groups, but the 
author has summed up her conclusions in convenient form in the final 
chapter. 


BetuH WELLMAN. 











a 
S 
a 
q 
‘ 
‘ 
t 
( 











New Publications 543 


VOCATIONAL GUIDANCE TESTS 


Tests for Vocational Guidance of Children Thirteen to Sixteen, by 
Herbert A. Toops. Teachers College Contributions to Educa- 
tion, No. 36, 1923. Teachers College, New York. Pp. XII+159. 
This volume is the report of the results of the work done by the 

Institute of Educational Research of Teachers College, Columbia 

University, to provide tests for use in the vocational guidance of children 

in their early teens, the assumption being made that vocational guidance 

is a function of the public elementaryschool. “The particular problem 
was to select or devise tests (1) that would be of value in predicting 
fitness for various careers; (2) that could be given (a) to children in 
fairly large groups (b) by any intelligent teacher or social worker who 
would give a reasonable amount of time to training for the work, and 

(c) within a time limit of three hours; and (3) that could be prepared 

and scored cheaply. 

Corresponding roughly to three of the trunk lines of vocational 
activities which a 15-year-old may enter (school, trade, office), three 
types of tests were investigated: (1) Tests of ability to deal with ideas 
and symbols, as represented by the Thorndike Arithmetical Problem- 
solving Tests and the ThorndikeMcCall Reading Test; (II) boys’ 
and girls’, tests of ability to deal with things and mechanisms (Sten- 
quist Assembly Tests and Mechanical Aptitude Tests I and II, 
Thurstone Manual Training Information Test, Army General Trade 
Test, M. I. T. Test and the I. E. R. Assembly Test for Girls); (III) 
tests of ability to deal with clerical items and procedures (I. E. R. 
General Clerical Scale, C-1, and I. E. R. Routine Clerical Test, C-2). 

The tests were given in a number of public schools of New York 
City, business colleges, and army trade and business schools. Further- 
more, their validity, within the limits of the present investigation, was 
studied in connection with adult occupational workers in corresponding 
fields. The vocational Guidance Tests finally chosen were: (I) The 
I.E.R. Arithmetic Reading Test (same as above), or any of the stand- 
ard tests of general intelligence; (II) for boys, the Stenquist Assembly 
Test; for girls, the 1.E.R. Assembly Test; (III) I.E.R. Clerical C-I 
and I.E.R.C-2. 

The investigators, in view of the conditions under which vocational 
guidance is now given, prepared the tests for a three-hour period, but 
their studies show that a much longer time is desirable. Notwith- 
standing the fact that the results of the extended study of the two 
clerical tests are not clear, these tests afford at least a safer prognosis 
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of deficiency in ability for clerical work than can be secured on the 
basis of intelligence tests alone. 

One of the incidental but outstanding contributions of the investi- 
gation is the discovery of a multiple ratio correlation technique for a 
rapid and systematic method of weighting tests, as contrasted with 
the hitherto laborious partial regression technique, thus making it 
possible to cut down very decidedly on the number of intercorrelations 
to be solved in the selection of the best tests from the total number. 
Of special interest to students seeking mastery in this field are the 
findings relative to constructional technique and _ underlying 
implications. 

J. KUDERNA. 
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